..

Permuterm Index

  • B-tree can handle prefix queries such as mon*
  • They can also handle suffix queries like *ed. We can do this by creating a new inverted index but with all the words backwards
  • The part that they struggle with is queries like an*l. The naive way is to process the query as an* AND *l
  • But this is very expensive
  • The solution is the Permuterm Index
  • It is a separate index that we create with all the permutations of the each of the term in the regular inverted index

Procedure to create all the permutations of a term

  • Suppose we have a term hello
  • First add the special character $ to end to make it hello$
  • The perform Left Circular Shift until the $ character is the first

[!example]

hello$
ello$h
llo$he
lo$hel
o$hell
$hello

Creation of the permuterm index

[!example] Consider a very small vocabulary of [hello, world] The permutations of hello are hello$, ello$h ... $hello The permutations of world are world$, orld$w, rld$wo ... $world The permuterm index will have [hello$, ello$h, llo$he, lo$hel, o$hell, $hello, world$, orld$w, rld$wo, ld$wor, d$worl, $world] The associated posting list of these will be the original word that they were derived from So the final index will look like

{

“hello$”: “hello”, “ello$h”: “hello”, “llo$he”: “hello”, “lo$hel”: “hello”, “o$hell”: “hello”, “$hello”: “hello”, “world$”: “world”, “orld$w”: “world”, “rld$wo”: “world”, “ld$wor”: “world”, “d$worl”: “world”, “$world”: “world” }

We store these as a *B-tree* - the same way an inverted index is stored.

How to process wild card queries

Suppose we have a wild card query like X*Y where X and Y are some strings. How do we process them? First we add $ to the end of the query to make it X*Y$ then we do circular left shift till the * is at the end. So X*Y$ -> *Y$X -> Y$X*. Then we perform a prefix search on Y$X.

Why does this work?

Remember that prefix search works well on B-trees. If we can turn any arbitrary wild card into a prefix search then we are golden.