..

2023-11-16

Expert NLP Talk

Word Embedding to transformers

Main problem is sequence to sequence learning
- How to learn and represent

History

Markov models
- Only need the last few elements to predict the next element in a sequence
- Don’t need the whole history
Shannon’s theory
Alan Turing
Georgetown experiment
John McCarthy terms AI
CNN
RNN
Transformers

Language Model

Probabilistic model to predict the next word given some history
Also give the probability that a sequence can occur

Neural LM

CNN and MLP not suitable for learning sequences
- Sequence dependency can’t be captured
RNN used here
- Exploding and vanishing gradient problem
LSTM and GRU introduced as solution
- Selective reading of history
- Can be thought of as gates or filters
- LSTM - 3 gates
- GRU - 2 gates
- Selective read, write and forget

Encoder-Decoder Model

Input->IR->Output

Attention

Captures context
Uses a context vector to focus more on some parts

Transformers

Self attention instead of global attention

Encoder Stack

Convert word to embedding
Add positional encoding
- Use sin() and cos() to do this
- Each word is a vector of length 512