2023-10-06

Chapter 1 - Boolean Retrieval

Questions

What is Information Retrieval?
- Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
What is Unstructured Data?
- The term Unstructured data refers to data that does not have clear, semantically overt, easy-for-a-computer structure
What are the three types of Information Retrieval?
- Web
- Personal
- Enterprise
Mention three reasons why grep won’t scale to IR?
- Dataset that are in order of billions
- Can’t perform queries like x NEAR y where NEAR can be defined as being within 3 words
- Can’t perform Ranked Retrieval. We need to have some sort of ranking of the documents that match the query
How do you avoid linearly scanning text?
- By indexing before hand
What are terms and give examples for when they aren’t the same as words
- Terms are indexed units
- They are the same as a word most of the time
- But terms like I-9 or Hong Kong aren’t thought of as words in the traditional sense

What is the structure of a term-document incidence matrix

Terms\Documents	Document 1	Document 2	Document n
Term 1	1	0	1
Term 2	1	1	0
Term 3	1	1	1
…
Term n	0	0	1

mat[t][d] is 1 if d has the term t in it else 0

What is a boolean retrieval model?
- The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators and, or, and not. The model views each document as just a set of words.
Suppose you are given a term-document incidence matrix. Consider a query brutus AND caesar AND NOT calpurnia. How would you go about processing this query?
- Get the vectors corresponding to each of the terms
- Perform bitwise logical operations between them as given in the query
- The resultant vector gives the documents that satisfy the query
- Eg:
  - brutus = [1, 0, 1, 0, 0, 1, 1, ..., 0]
  - caesar = [0, 1, 0, 1, 0, 0, 1, ..., 1]
  - calpurnia = [1, 1, 0, 0, 1, 1, 0, ..., 0]
  - brutus AND caesar AND NOT calpurnia
  - [1, 0, 1, ..., 0] & [0, 1, 0, ..., 1] & ~[1, 1, 0, ..., 0]
What is a document?
- Document is the unit of information that we want to returned when we query a IR system
- They can be memos, chapters, files etc …
What is a collection and what is its other name?
- Collection is the set of documents over which a query is run
- It is also known as a corpus
Differentiate precision and recall
- Precision: What proportion of positive identifications was actually correct?
  - $\text{Precision} = \frac{TP}{TP+FP}$
- Recall: What proportion of actual positives was identified correctly?
  - $\text{Recall} = \frac{TP}{TP+FN}$
Which deficiency of the Term-Document Incidence Matrix does Inverted Index means to solve?
- In any given corpus, each document only has a very small percent of the entire terms of the corpus
- Suppose we have a corpus of 1000 documents with each having 100 words
- Assume that 25% of words in a document is common to all. This means that there are 75 new words per document. Totally we would have 75025 unique words. Since a document only has 100 words, only 100 of its cells will have a 1. The rest (74925) will be 0s
- This makes the matrix very sparse. It would be a waste of memory to store that many 0s which don’t even make any useful contribution to the search
- This is the main deficiency Inverted Index solves
- The main idea is to somehow only store those 1s
What is the structure of an inverted index
- An inverted index can be thought of as map from terms to list of documents containing them
- {"Brutus": [69, 420]}
What are other names for the structure associated with an inverted index
- The data structure is known as a dictionary
- The set of terms is known as a vocabulary
Define posting, posting list, postings
- Posting is a document-id that contains a given terms
- Posting list is a list of posting
- Postings is the set of all posting lists

type Posting int
type PositingList Posting[]
type Postings PostingList[]

Comment on the order in which the terms and the posting occur in a dictionary
- Terms are sorted alphabetically
- Posting List is sorted by document ID
Illustrate the different stages in building an inverted index with the following corpus
- new home sales top forecasts
- home sales rise in july
- increase in home sales in july
- july new home sales rise

('new', 0)
('home', 0)
('sales', 0)
('top', 0)
('forecasts', 0)
('home', 1)
('sales', 1)
('rise', 1)
('in', 1)
('july', 1)
('increase', 2)
('in', 2)
('home', 2)
('sales', 2)
('in', 2)
('july', 2)
('july', 3)
('new', 3)
('home', 3)
('sales', 3)
('rise', 3)

Sorting

('forecasts', 0)
('home', 0)
('home', 1)
('home', 2)
('home', 3)
('in', 1)
('in', 2)
('in', 2)
('increase', 2)
('july', 1)
('july', 2)
('july', 3)
('new', 0)
('new', 3)
('rise', 1)
('rise', 3)
('sales', 0)
('sales', 1)
('sales', 2)
('sales', 3)
('top', 0)

Merging

forecasts [0]
home [0, 1, 2, 3]
in [1, 2, 2]
increase [2]
july [1, 2, 3]
new [0, 3]
rise [1, 3]
sales [0, 1, 2, 3]
top [0]

How to process boolean AND query when using inverted indices?
- To process AND operations find the posting list corresponding to the terms
- Perform the merge() operation on them to get the final list
Give pseudocode for merge() ?

def merge(p1, p2):
	ret = []
	p1_ptr = 0
	p2_ptr = 0
	# Check if we are actually inside the list
	while p1_ptr < len(p1) and p2_ptr < len(p2):
		if p1[p1_ptr] == p2[p2_ptr]:
			ret.append(p1[p1_ptr])
		# Increment smaller on until it becomes equal
		elif p1[p1_ptr] < p2[p2_ptr]:
			p1_ptr += 1
		else:
			p2_ptr += 1
	return ret

What is the complexity of the merge() operation?
- $O(x+y)$ where $x$ and $y$ are the lengths of the posting lists
How would you go about optimizing a conjunctive query? And give the rationale
- Process the smallest lists first
- The complexity of the merge() operation is $O(x+y)$
- This is the worst case complexity when we have to traverse both the lists fully
- The algorithm ends if one of the pointers goes out of the list
- A key observation of merge() is that len(merge(a, b)) <= min(len(a), len(b))
- The resultant list always smaller or equal in size to the smaller list
- By processing the smallest list first we are putting a bound on the list size
Suppose you have a query of the form (t1 OR t2) AND (t3 or t4) AND (t5 OR t6). Which disjunction would you perform first to reduce the overall complexity
- In the case of disjunctions we can calculate their frequency as the sum of the individual frequencies. This is the worst case estimate when both the list are totally different from one another
- Then we can use the regular AND optimization of processing the smaller lists first