Stanford CS 229 | Building LLM

Language Modeling

LM’s are generative models, once you have the model of the distribution of the data, we can sample from the distribution and sample the data.

Autoregressive (AR) language models (chain rule of probability):

p( $x_1,...,x_n$) = p( $x_1$) p( $x_2$| $x_1$) p( $x_3$| $x_2,x_1$)… = $∏_i$ p( $x_i$ | $x_{1:i-1}$)

Steps:

Tokenize (and give ID for each token)
Forward (pass it through the model architecture)
predict probability of next token (forward pass will result in a probability distribution over the next word/token)
sample (get a new token)
detokenize (new ID)