Permuterm Index
- B-tree can handle prefix queries such as
mon*
- They can also handle suffix queries like
*ed
. We can do this by creating a new inverted index but with all the words backwards - The part that they struggle with is queries like
an*l
. The naive way is to process the query asan* AND *l
- But this is very expensive
- The solution is the Permuterm Index
- It is a separate index that we create with all the permutations of the each of the term in the regular inverted index
Procedure to create all the permutations of a term
- Suppose we have a term
hello
- First add the special character
$
to end to make ithello$
- The perform Left Circular Shift until the
$
character is the first
[!example]
hello$ ello$h llo$he lo$hel o$hell $hello
Creation of the permuterm index
[!example] Consider a very small vocabulary of
[hello, world]
The permutations ofhello
arehello$, ello$h ... $hello
The permutations ofworld
areworld$, orld$w, rld$wo ... $world
The permuterm index will have[hello$, ello$h, llo$he, lo$hel, o$hell, $hello, world$, orld$w, rld$wo, ld$wor, d$worl, $world]
The associated posting list of these will be the original word that they were derived from So the final index will look like{
“hello$”: “hello”, “ello$h”: “hello”, “llo$he”: “hello”, “lo$hel”: “hello”, “o$hell”: “hello”, “$hello”: “hello”, “world$”: “world”, “orld$w”: “world”, “rld$wo”: “world”, “ld$wor”: “world”, “d$worl”: “world”, “$world”: “world” }
We store these as a *B-tree* - the same way an inverted index is stored.
How to process wild card queries
Suppose we have a wild card query like X*Y
where X
and Y
are some strings. How do we process them? First we add $
to the end of the query to make it X*Y$
then we do circular left shift till the *
is at the end. So X*Y$ -> *Y$X -> Y$X*
. Then we perform a prefix search on Y$X
.
Why does this work?
Remember that prefix search works well on B-trees. If we can turn any arbitrary wild card into a prefix search then we are golden.