..

Expert NLP Talk

Word Embedding to transformers

  • Main problem is sequence to sequence learning
    • How to learn and represent

History

  • Markov models
    • Only need the last few elements to predict the next element in a sequence
    • Don’t need the whole history
  • Shannon’s theory
  • Alan Turing
  • Georgetown experiment
  • John McCarthy terms AI
  • CNN
  • RNN
  • Transformers

Language Model

  • Probabilistic model to predict the next word given some history
  • Also give the probability that a sequence can occur

Neural LM

  • CNN and MLP not suitable for learning sequences
    • Sequence dependency can’t be captured
  • RNN used here
    • Exploding and vanishing gradient problem
  • LSTM and GRU introduced as solution
    • Selective reading of history
    • Can be thought of as gates or filters
    • LSTM - 3 gates
    • GRU - 2 gates
    • Selective read, write and forget

Encoder-Decoder Model

  • Input->IR->Output

Attention

  • Captures context
  • Uses a context vector to focus more on some parts

Transformers

  • Self attention instead of global attention

Encoder Stack

  • Convert word to embedding
  • Add positional encoding
    • Use sin() and cos() to do this
    • Each word is a vector of length 512