https://youtu.be/9vM4p9NN0Ts?si=H3hK7m7wrETW7N0R
Language Modeling
- LM: probability distribution over sequences of tokens/words p(x1,…,xn)
- Gives a probability of a sentence occurring in daily vocabulary or online.
- Ex: P(hello, world) = 0.02, p(cheese, ate, mouse) = 0.001
LM’s are generative models, once you have the model of the distribution of the data, we can sample from the distribution and sample the data.
Autoregressive (AR) language models (chain rule of probability):
p( $x_1,...,x_n$) = p( $x_1$) p( $x_2$| $x_1$) p( $x_3$| $x_2,x_1$)… = $∏_i$ p( $x_i$ | $x_{1:i-1}$)
- Model that can predict the next word given the past context
- Downsides: runs a for loop to generate the next word, so it takes longer time.
Steps:
- Tokenize (and give ID for each token)
- Forward (pass it through the model architecture)
- predict probability of next token (forward pass will result in a probability distribution over the next word/token)
- sample (get a new token)
- detokenize (new ID)