2023-10-07

Dictionary

Tokens

[!question]- What is token A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing

[!question]- What is a type Class of all token containing the same sequence of characters

[!question]- What is a term It is a type that is indexed

[!question]- What are some of the problems in tokenization

[!question]- Why would you eliminate stop words?

[!question]- Why is the trend of eliminating stop words dying out?

[!question]- What is the need for normalization?

[!question]- What is an important criterion when doing normalization?

[!question]- Is the normalization for the indexed text and the query the same?

[!question]- The general convention is to convert everything to lowercase. But how would you differentiate between FED and fed? Give an example to support the claim that it is ok to convert everything to lowercase

[!question]- What is the alternative to normalization

[!question]- How would you handle synonyms and homonyms

[!question]- How to deal with spelling mistakes?

Stemming and Lemmatization

[!question]- What is the difference between stemming and lemmatization?

[!question]- What is the most common stemming algorithm for English?

[!question]- Does stemming help?

Skip Pointers

[!question]- What is the tradeoff associated in the number of skip pointers used?

[!question]- What is the general heuristic for skip distance?

[!question]- What is the problem with the $\sqrt{L}$ heuristic?

[!question]- Suppose the skip distance is $n$. How many posting are in between the skip pointer and the destination