..
TF-IDF
Notes
TF-IDF
Questions
[!question]- Define TF
def term_frequency(term:str, doc:list[str])->float: return doc.count(term)/ len(doc)
[!question]- Define IDF
def inverse_document_frequency(term: str, corpus:list[doc])-> float: tmp = 0 for doc in corpus: if term in doc: tmp += 1 return math.log(len(corpus)/tmp, base=10)
[!question]- Define TF-IDF
def tf_idf(term: str, doc: list[str], corpus: list[doc]): return tf(term, doc) * idf(term, corpus)
[!question]- What is cosine similarity in the mathematical context?
- $\vec{p} \cdot \vec{q}=\left | \vec{p} \right |\left | \vec{q} \right |\cos\theta$
- $\cos\theta = \frac{\vec{p}\cdot\vec{q}}{|\vec{p}|||\vec{q}|}$
[!question]- How would you calculate the cosine similarity for two given documents?
- Split the document into words
- Create a vector with words and their frequencies
- Apply the formula
[!question]- What are meanings of the values 0, 1 and -1 in cosine similarity? They correspond to the $\cos$ of the angle between the vectors So a value of 0 means they are perpendicular Value of 1 means the angle is 0 hence very similar -1 mean that the angle is $180^\degree{}$