Chapter 1 - Boolean Retrieval
Questions
-
What is Information Retrieval?
- Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
-
What is Unstructured Data?
- The term Unstructured data refers to data that does not have clear, semantically overt, easy-for-a-computer structure
-
What are the three types of Information Retrieval?
- Web
- Personal
- Enterprise
-
Mention three reasons why
grep
won’t scale to IR?- Dataset that are in order of billions
- Can’t perform queries like
x NEAR y
whereNEAR
can be defined as being within 3 words - Can’t perform Ranked Retrieval. We need to have some sort of ranking of the documents that match the query
-
How do you avoid linearly scanning text?
- By indexing before hand
-
What are terms and give examples for when they aren’t the same as words
- Terms are indexed units
- They are the same as a word most of the time
- But terms like I-9 or Hong Kong aren’t thought of as words in the traditional sense
-
What is the structure of a term-document incidence matrix
Terms\Documents Document 1 Document 2 … Document n Term 1 1 0 1 Term 2 1 1 0 Term 3 1 1 1 … Term n 0 0 1 mat[t][d]
is 1 ifd
has the termt
in it else 0 -
What is a boolean retrieval model?
- The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators and, or, and not. The model views each document as just a set of words.
-
Suppose you are given a term-document incidence matrix. Consider a query
brutus AND caesar AND NOT calpurnia
. How would you go about processing this query?- Get the vectors corresponding to each of the terms
- Perform bitwise logical operations between them as given in the query
- The resultant vector gives the documents that satisfy the query
- Eg:
brutus = [1, 0, 1, 0, 0, 1, 1, ..., 0]
caesar = [0, 1, 0, 1, 0, 0, 1, ..., 1]
calpurnia = [1, 1, 0, 0, 1, 1, 0, ..., 0]
brutus AND caesar AND NOT calpurnia
[1, 0, 1, ..., 0] & [0, 1, 0, ..., 1] & ~[1, 1, 0, ..., 0]
-
What is a document?
- Document is the unit of information that we want to returned when we query a IR system
- They can be memos, chapters, files etc …
-
What is a collection and what is its other name?
- Collection is the set of documents over which a query is run
- It is also known as a corpus
-
Differentiate precision and recall
- Precision: What proportion of positive identifications was actually correct?
- $\text{Precision} = \frac{TP}{TP+FP}$
- Recall: What proportion of actual positives was identified correctly?
- $\text{Recall} = \frac{TP}{TP+FN}$
- Precision: What proportion of positive identifications was actually correct?
-
Which deficiency of the Term-Document Incidence Matrix does Inverted Index means to solve?
- In any given corpus, each document only has a very small percent of the entire terms of the corpus
- Suppose we have a corpus of 1000 documents with each having 100 words
- Assume that 25% of words in a document is common to all. This means that there are 75 new words per document. Totally we would have 75025 unique words. Since a document only has 100 words, only 100 of its cells will have a 1. The rest (74925) will be 0s
- This makes the matrix very sparse. It would be a waste of memory to store that many 0s which don’t even make any useful contribution to the search
- This is the main deficiency Inverted Index solves
- The main idea is to somehow only store those 1s
-
What is the structure of an inverted index
- An inverted index can be thought of as map from terms to list of documents containing them
{"Brutus": [69, 420]}
-
What are other names for the structure associated with an inverted index
- The data structure is known as a dictionary
- The set of terms is known as a vocabulary
-
Define posting, posting list, postings
- Posting is a document-id that contains a given terms
- Posting list is a list of posting
- Postings is the set of all posting lists
type Posting int
type PositingList Posting[]
type Postings PostingList[]
- Comment on the order in which the terms and the posting occur in a dictionary
- Terms are sorted alphabetically
- Posting List is sorted by document ID
- Illustrate the different stages in building an inverted index with the following corpus
- new home sales top forecasts
- home sales rise in july
- increase in home sales in july
- july new home sales rise
('new', 0)
('home', 0)
('sales', 0)
('top', 0)
('forecasts', 0)
('home', 1)
('sales', 1)
('rise', 1)
('in', 1)
('july', 1)
('increase', 2)
('in', 2)
('home', 2)
('sales', 2)
('in', 2)
('july', 2)
('july', 3)
('new', 3)
('home', 3)
('sales', 3)
('rise', 3)
Sorting
('forecasts', 0)
('home', 0)
('home', 1)
('home', 2)
('home', 3)
('in', 1)
('in', 2)
('in', 2)
('increase', 2)
('july', 1)
('july', 2)
('july', 3)
('new', 0)
('new', 3)
('rise', 1)
('rise', 3)
('sales', 0)
('sales', 1)
('sales', 2)
('sales', 3)
('top', 0)
Merging
forecasts [0]
home [0, 1, 2, 3]
in [1, 2, 2]
increase [2]
july [1, 2, 3]
new [0, 3]
rise [1, 3]
sales [0, 1, 2, 3]
top [0]
- How to process boolean
AND
query when using inverted indices?- To process
AND
operations find the posting list corresponding to the terms - Perform the
merge()
operation on them to get the final list
- To process
- Give pseudocode for
merge()
?
def merge(p1, p2):
ret = []
p1_ptr = 0
p2_ptr = 0
# Check if we are actually inside the list
while p1_ptr < len(p1) and p2_ptr < len(p2):
if p1[p1_ptr] == p2[p2_ptr]:
ret.append(p1[p1_ptr])
# Increment smaller on until it becomes equal
elif p1[p1_ptr] < p2[p2_ptr]:
p1_ptr += 1
else:
p2_ptr += 1
return ret
- What is the complexity of the
merge()
operation?- $O(x+y)$ where $x$ and $y$ are the lengths of the posting lists
- How would you go about optimizing a conjunctive query? And give the rationale
- Process the smallest lists first
- The complexity of the
merge()
operation is $O(x+y)$ - This is the worst case complexity when we have to traverse both the lists fully
- The algorithm ends if one of the pointers goes out of the list
- A key observation of
merge()
is thatlen(merge(a, b)) <= min(len(a), len(b))
- The resultant list always smaller or equal in size to the smaller list
- By processing the smallest list first we are putting a bound on the list size
- Suppose you have a query of the form
(t1 OR t2) AND (t3 or t4) AND (t5 OR t6)
. Which disjunction would you perform first to reduce the overall complexity- In the case of disjunctions we can calculate their frequency as the sum of the individual frequencies. This is the worst case estimate when both the list are totally different from one another
- Then we can use the regular
AND
optimization of processing the smaller lists first