..

TF-IDF

Notes

TF-IDF

Notebook

Questions

[!question]- Define TF

def term_frequency(term:str, doc:list[str])->float:
  return doc.count(term)/ len(doc)

[!question]- Define IDF

def inverse_document_frequency(term: str, corpus:list[doc])-> float:
  tmp = 0
  for doc in corpus:
  	if term in doc:
  		tmp += 1
  return math.log(len(corpus)/tmp, base=10)

[!question]- Define TF-IDF

def tf_idf(term: str, doc: list[str], corpus: list[doc]):
  return tf(term, doc) * idf(term, corpus)

[!question]- What is cosine similarity in the mathematical context?

  • $\vec{p} \cdot \vec{q}=\left | \vec{p} \right |\left | \vec{q} \right |\cos\theta$
  • $\cos\theta = \frac{\vec{p}\cdot\vec{q}}{|\vec{p}|||\vec{q}|}$

[!question]- How would you calculate the cosine similarity for two given documents?

  • Split the document into words
  • Create a vector with words and their frequencies
  • Apply the formula

[!question]- What are meanings of the values 0, 1 and -1 in cosine similarity? They correspond to the $\cos$ of the angle between the vectors So a value of 0 means they are perpendicular Value of 1 means the angle is 0 hence very similar -1 mean that the angle is $180^\degree{}$