..

Introduction

Rest of the questions

Chapter 1

Phrase Queries

[!question]- How to search for phrase queries using bigrams Instead of the index just containing single words, now index using two words Then 2 word phrase queries becomes trivial

[!question]- Suppose you are the search for the phrase foo bar foobar. The index only contains bigrams. How to make this work without reconstruction of the index? Convert the query using conjunction. foo bar AND bar foobar

[!question]- What is the problem of using conjunction to deal with longer phrase queries with just a bigram index? The results have to be manually checked again to verify that the document actually contains the phrase This occurs because there is no sense of position in a bigram index That means that AND can be satisfied even if the words are at the polar ends of the document Thus this method can produce false positives

[!question]- What are the issue with using bi-word indices?

  • False positives
  • Dictionary size

[!question]- What is the better solution to bi-word index? Positional Index

[!question]- Give the structure of a positional index

type positional_index dict(term, positional_posting_list)
type positional_posting_list dict(doc_id, positions)
type positions doc_id[]
type doc_id int

[!example] {"foo": {0: [1, 4], 1: [5, 6], 3: [5, 4]}

[!question]- How to process a phrase query with positional index Apply the merge() algorithm recursively at the document level

[!question]- What is a proximity query It is similar to a phrase query but the words don’t have to be right next to each other. They can be within a distance of k which is specified by the user

[!question]- As a rule of thumb what is the size difference between a positional and a non-positional index Roughly 2-4 times

[!question]- Estimate the size of a positional index in relation to the original text size 35-50%

[!question]- How would you combine bi-word index and positional index? The drawback of positional index is when searching for words whose posting lists are very large Simple queries like Michael Jordan can take a lot of time But this is very easy for a bi-word index Thus by having both and using them complementarily we reap the benefits of both without any of the disadvantages

[!question]- Give the algorithm for proximity query using positional index