..

Tolerant Retrieval

Dictionary Storing Mechanism

[!question]- What are the two main data structures to store dictionaries

[!question]- What are the pros of using a hashtable

[!question]- What are the cons of using a hashtable

[!question]- What are the pros of using a Tree

[!question]- What are the cons of using a Tree

[!question]- Draw a sample tree

Wild Card Queries

[!question]- Using tree how would you retrieve queries like mon*

[!question]- Using tree how would you retrieve queries like *ed

[!question]- What is the flaw with using trees to process wildcard queries

[!question]- What is the solution for handling arbitrary wildcard queries using b-trees

Refer Permuterm for concepts

[!question]- What is the problem with permuterm index

Bi-gram index

[!question]- How would you construct a bi-gram index

[!question]- Are bigram index and biword index the same?

[!question]- How would you process a wild card query using n-gram indices

[!question]- Give the pros and cons of bi-gram and Permuterm

Spelling Correction

[!question]- What are the two main uses of spelling correction

[!question]- What are the two types of spelling correction

[!question]- Give one practical use case of spelling correction

[!question]- In isolated word correction what are the choices for the source of lexicon?

[!question]- What is a subtle drawback of using the indexed corpus as a source of lexicons

[!question]- What are the three ways to do isolated spelling correction?

[!question]- Why do we need weighted edit distance?

[!question]- Give the Levenshtein edit distance algorithm

def edit_distance(a:str, b:str):
 	mat = [[0 for _ in range(len(b)+1)] for _ in range(len(a)+1)] 
 	mat[0][0] = 0;
 	for i in range(len(a)+1):
 		mat[i][0] = i
 	for i in range(len(b)+1):
 		mat[0][i] = i
 	for i in range(1, len(a)+1):
 		for j in range(1, len(b)+1):
 			mat[i][j] = min(
	 			mat[i-1][j-1] + (0 if a[i-1] == b[j-1] else 1),
		 		mat[i-1][j]+1,
	 			mat[i][j-1]+1 )
 	return mat[-1][-1]

[!question] Give the heuristic for filling each of the cells in computing edit distance

Diag (0 if equal or 1 if not equal) Above +1
Left +1 Min of all the other three

[!question]- Give the structure of a k-gram index dict(k-gram, term[]) It is a mapping from a k-gram to the terms from the original index that contains that k-gram

[!question]- What is the formula for Jaccard Coefficient

[!question]- Give pseudo code for computing JC using only one of the kgrams

def jaccard_coefficient(query:str, term:str, k:int)->float:
  kgram = kgram_create(query, k)
  ins = 0
  for q in kgram:
  	if q in term:
  		ins += 1
  return ins/(kgram_num(term, k) + len(kgram) - ins)

[!question]- How to do context sensitive correction

[!question]- Give the mathematical formula for ranking the alternative spellings

Soundex

[!question]- Give the format of a soundex reduced term

[!question]- Give the soundex algorithm

  • Retain the first word
  • Replace letters using the following rules
    • aeiou why - 0
    • bfpv - 1
    • other letters - 2
    • dt - 3
    • l - 4
    • mn- 5
    • r - 6
  • Remove all the numbers that repeat
  • Remove all the 0s
  • Pad with trailing zeros until you get a 4 character sequence
  • Return the 4 character sequence

[!question]- What should you do first - remove consecutive digits or remove zeros? Remove consecutive digits