Building a GPT From Scratch summary
Karpathy’s gpt_dev.ipynb summarized by gemini, reviewed by me.
Building a GPT From Scratch: A Character-Level Approach
Understanding the Core Concepts (Based on A. Karpathy’s NanoGPT)
Objective
- Goal: To understand the fundamental steps and components involved in building and training a simple Generative Pre-trained Transformer (GPT) model.
- Approach: Character-level text generation on the Tiny Shakespeare dataset.
- Focus: Key concepts: Tokenization, Embedding, Self-Attention, Transformer Blocks, Training.
The Task - Language Modeling
- Core Idea: Predict the next character in a sequence given the preceding characters.
- Example: Given
"Hello"
, predict" "
. Given"Hello "
, predict"w"
. - Why Characters? Simplifies the process, no complex tokenizers needed initially. Demonstrates core concepts effectively.
- Dataset: Tiny Shakespeare (~1 million characters).
Data Preparation - Step 1: Loading & Inspection
- Load the text data (
input.txt
). - Inspect basic properties:
- Total number of characters.
- A sample of the text content.
- Identify the set of unique characters (the vocabulary).
vocab_size
: Number of unique characters (e.g., 65 in the notebook).
Data Preparation - Step 2: Tokenization
- Concept: Convert characters into numerical representations that the model can process.
- Mapping:
stoi
(string-to-integer): Assign a unique integer to each character in the vocabulary.itos
(integer-to-string): The reverse mapping.
- Encoding/Decoding: Functions to convert text to sequences of integers and back.
- Entire Dataset: Convert the full text into a single long sequence of integers (a PyTorch Tensor).
Data Preparation - Step 3: Creating Batches for Training
- Goal: Feed the model chunks of data efficiently.
block_size
(Context Length): How many preceding characters the model looks at to predict the next one (e.g., 8, then 32).- Input (
x
) / Target (y
):- For a sequence:
[18, 47, 56, 57, 58, 1, 15, 47, 58] (block_size=8)
x
(input context):[18, 47, 56, 57, 58, 1, 15, 47]
y
(target next char):[47, 56, 57, 58, 1, 15, 47, 58]
- The model learns: When input is
[18]
, target is47
. When input is[18, 47]
, target is56
, etc.
- For a sequence:
batch_size
: Number of independent sequences processed in parallel for efficiency (e.g., 4, then 16/32).get_batch
Function: Randomly samples starting points to create batches ofx
andy
.
A Simple Start: Bigram Language Model
- Idea: The simplest possible model. Predicts the next character based only on the immediately preceding character.
- Mechanism: Uses an Embedding Table.
- Size:
vocab_size
xvocab_size
. - Row
i
contains the model’s predicted scores (logits) for the next character when the input character isi
.
- Size:
- Limitation: Ignores all context beyond the last character. Output often looks random/nonsensical (as seen in the early
generate
example).
The Key Idea: Self-Attention
- Problem: How can tokens (characters) aggregate information from earlier tokens in the sequence in a data-dependent way?
- Intuition: For a given token, we want it to “look” at previous tokens and decide which ones are most relevant for predicting the next token. It then aggregates information from those relevant tokens.
- Mechanism: Weighted aggregation based on token similarity.
Self-Attention: Query, Key, Value
- For each token (represented as a vector after embedding):
- Query (Q): What I’m looking for.
- Key (K): What information I contain.
- Value (V): The actual information I’ll provide if attended to.
- Process:
- Calculate Attention Scores: How much does my Query (Q) match each previous Key (K)? (
Q @ K^T
). - Scale scores (divide by
sqrt(head_size)
). - Mask: Prevent tokens from attending to future tokens (use
tril
mask, set future scores tofloat('-inf')
). Crucial for autoregressive generation. - Softmax: Convert scores into probabilities (weights) that sum to 1.
- Aggregate Values: Compute weighted sum of Values (V) using the softmax weights (
softmax(scores) @ V
).
- Calculate Attention Scores: How much does my Query (Q) match each previous Key (K)? (
- Result: An output vector for each token that incorporates information from relevant preceding tokens.
Enhancements: Multi-Head Attention & Scaling
- Multi-Head Attention:
- Run the self-attention mechanism multiple times in parallel (“heads”) with different Q, K, V projections.
- Allows the model to focus on different types of relationships/information simultaneously.
- Concatenate results from all heads and project back to the original dimension.
- Scaled Attention: Dividing scores by
sqrt(head_size)
prevents scores from becoming too large, keeping softmax from producing overly sharp distributions and aiding training stability.
The Transformer Block
- The core building block of the GPT model. Combines:
- Communication (Self-Attention): Tokens gather information (
MultiHeadAttention
). Followed by Layer Normalization. - Computation (Feed-Forward): Processes the aggregated information independently at each position (
FeedForward
network - usually a simple MLP). Followed by Layer Normalization.
- Communication (Self-Attention): Tokens gather information (
- Residual Connections: Add the input
x
to the output of both the attention and feed-forward layers (output = x + SubLayer(LayerNorm(x))
). Helps with training deep networks (gradient flow). - Layer Normalization: Normalizes features across the embedding dimension for each token independently. Stabilizes training.
Positional Encoding
- Problem: Self-attention itself doesn’t know the order of tokens (it just sees a set of vectors). “A B C” looks the same as “C B A” to attention alone.
- Solution: Add information about the token’s position in the sequence.
- Method (in notebook): Learnable Positional Embeddings. Create an embedding table (
position_embedding_table
) of sizeblock_size
xembedding_dim
. Add the corresponding position embedding to the token embedding.
Full GPT Model Architecture
- Input: Sequence of token indices
(B, T)
. - 1. Embeddings:
- Token Embeddings (
token_embedding_table
):(B, T) -> (B, T, C)
- Positional Embeddings (
position_embedding_table
):(T) -> (T, C)
- Sum them:
x = tok_emb + pos_emb
- Token Embeddings (
- 2. Transformer Blocks: Pass
x
through multipleBlock
layers (n_layer
times). Each block contains Multi-Head Attention and Feed-Forward layers with Residuals and LayerNorm.x = blocks(x)
- 3. Final Layer Norm:
x = ln_f(x)
- 4. Linear Head: Project final embeddings to vocabulary size (
lm_head
).logits = lm_head(x)
->(B, T, vocab_size)
- Output: Logits (raw scores) for the next token prediction at each time step.
Training Loop
- Objective: Adjust model parameters (weights) to minimize prediction error.
- Loss Function: Cross-Entropy Loss (compares predicted logits against the actual target
y
). - Steps (repeated
max_iters
times):get_batch
: Sample a batch of inputs (xb
) and targets (yb
).- Forward pass: Calculate
logits
andloss
using the model (logits, loss = model(xb, yb)
). optimizer.zero_grad()
: Clear old gradients.loss.backward()
: Compute gradients (how much each parameter contributed to the loss).optimizer.step()
: Update parameters using an optimizer (e.g., AdamW) based on gradients and learning rate.
- Evaluation: Periodically check loss on a validation set (
estimate_loss
) to monitor overfitting.
Note: Understanding the Loss Function: Cross-Entropy
1. The Goal: * Our model needs to predict the correct next character out of all possible characters in the vocabulary (e.g., 65 options in Tiny Shakespeare).
2. Model’s Prediction: * For a given context, the model outputs logits – raw scores for each possible next character. * Example Logits: {'a': 0.1, 'b': -2.0, 'c': 1.5, ...}
* These logits are converted into probabilities using the Softmax function. Softmax makes sure probabilities are between 0 and 1, and they all add up to 1. * Example Probabilities: {'a': 0.20, 'b': 0.02, 'c': 0.78, ...}
* This probability distribution represents the model’s belief about what the next character will be.
3. The Target (Ground Truth): * We know the actual next character from the training data. * Think of the target as a “perfect” probability distribution: 100% probability for the correct character and 0% for all others. * Example Target (if ‘c’ is the correct next char): {'a': 0, 'b': 0, 'c': 1.0, ...}
4. What is Cross-Entropy Loss? * It measures the difference (or “distance”) between the model’s predicted probability distribution and the target distribution. * Essentially, it asks: “How well do the model’s predicted probabilities match the actual outcome?”
5. How it Works (Intuition): * Cross-Entropy focuses heavily on the probability the model assigned to the correct character. * A simplified way to think about it for this single-correct-answer task is: Loss = -log(predicted_probability_of_the_correct_character)
* Why -log
? * If prediction for the correct character is high (e.g., 0.9), log(0.9)
is slightly negative, so -log(0.9)
is a small positive loss. (Good!) * If prediction for the correct character is low (e.g., 0.01), log(0.01)
is very negative, so -log(0.01)
is a large positive loss. (Bad!) * It penalizes the model heavily for being wrong, especially if it was confident (assigned low probability to the correct answer).
6. Training Goal: * Minimize the Cross-Entropy loss. This forces the model to learn parameters that assign higher probabilities to the correct next characters across the training data.
Generation (Inference)
- Goal: Produce new text.
- Process (Autoregressive):
- Start with an initial context (e.g., a single token like newline or
torch.zeros
). - Feed the current context (up to
block_size
last tokens) into the model to getlogits
for the next token. - Focus on the
logits
for the very last time step. - Apply
softmax
to get probabilities. - Sample the next token index based on these probabilities (
torch.multinomial
). - Append the sampled token index to the context.
- Repeat steps 2-6 for
max_new_tokens
. decode
the final sequence of indices back into text.
- Start with an initial context (e.g., a single token like newline or
Key Takeaways & Next Steps
- Recap: Built a character-level GPT using core components: Embeddings (Token+Position), Self-Attention (Q,K,V), Transformer Blocks (Attention + FeedForward + Residuals + LayerNorm), and a standard training loop.
- Self-Attention: The key mechanism allowing tokens to communicate across the sequence.
- Transformer Blocks: The repeatable unit combining communication and computation.
- Next Steps: Scaling up (more data, larger
n_embd
,n_head
,n_layer
), using sub-word tokenization (like BPE), exploring different architectures, fine-tuning on specific tasks.