This project aims to provide a comprehensive understanding of Large Language Models (LLMs) and implement one from scratch. It serves as both a detailed technical reference and a practical implementation guide. The repository contains:
- Detailed Documentation: A thorough explanation of transformer architecture, tokenization, embeddings, attention mechanisms, and training methodologies
- Implementation Code: From-scratch implementations of all core LLM components
- Visualization Tools: Interactive web interface for exploring model internals and understanding how transformers process text
- Training Pipeline: Tools for training custom models on specific datasets
- Benchmarking Suite: Standardized evaluation frameworks to assess model capabilities
This document explains the theoretical foundations of LLMs in detail, with diagrams illustrating the architecture and data flow. For implementation details, see the corresponding code in the src
directory. If you encounter unfamiliar terms, please refer to our Glossary for definitions.
graph TD
classDef theory fill:#f9d5e5,stroke:#333,stroke-width:1px
classDef implementation fill:#e06377,stroke:#333,stroke-width:1px
classDef evaluation fill:#c83349,stroke:#333,stroke-width:1px
Theory[Theoretical Understanding] --> Implement[Implementation]
Implement --> Train[Training]
Train --> Eval[Evaluation]
Eval --> Deploy[Deployment]
subgraph "Project Components"
Theory -.- Doc[Documentation]
Theory -.- Visual[Visualizations]
Implement -.- Code[Source Code]
Implement -.- Pipe[Pipelines]
Train -.- Data[Datasets]
Train -.- Infra[Infrastructure]
Eval -.- Bench[Benchmarks]
Eval -.- Metrics[Metrics]
Deploy -.- API[API Layer]
Deploy -.- UI[User Interface]
end
class Theory,Doc,Visual theory
class Implement,Code,Pipe,Train,Data,Infra implementation
class Eval,Bench,Metrics,Deploy,API,UI evaluation
In 2017, researchers at Google Brain published a groundbreaking paper titled "Attention is All You Need," which introduced the Transformer architecture. Before this, most sequence-to-sequence models relied on recurrent neural networks (RNNs) like LSTMs or GRUs, which processed text sequentially, word by word. This sequential nature created bottlenecks:
- Sequential processing made parallelization difficult
- Long-range dependencies were hard to capture due to the vanishing gradient problem
- Training was slow and computationally expensive
The key innovation of the paper was replacing recurrence entirely with a mechanism called "attention," allowing the model to directly connect any words in a sequence, regardless of their distance from each other. This enabled parallel processing and better handling of long-range dependencies.
The original Transformer is an encoder-decoder architecture designed for sequence-to-sequence tasks like translation. Here's a detailed breakdown:
graph TD
classDef embedding fill:#f9d5e5,stroke:#333,stroke-width:1px
classDef attention fill:#eeac99,stroke:#333,stroke-width:1px
classDef feedforward fill:#e06377,stroke:#333,stroke-width:1px
classDef normalization fill:#c83349,stroke:#333,stroke-width:1px
Input[Input Sequence] --> TokenEmbed[Token Embeddings]
TokenEmbed --> PosEnc[Positional Encoding]
subgraph "Encoder Stack (Nx)"
PosEnc --> EncStack[Encoder Stack]
subgraph "Single Encoder Block"
EncIn[Input] --> EncMHA[Multi-Head Attention]
EncMHA --> EncAddNorm1[Add & Normalize]
EncAddNorm1 --> EncFF[Feed Forward Network]
EncFF --> EncAddNorm2[Add & Normalize]
EncAddNorm2 --> EncOut[Output]
end
end
subgraph "Decoder Stack (Nx)"
DecoderInput[Decoder Input] --> DecStack[Decoder Stack]
EncOut --> DecStack
subgraph "Single Decoder Block"
DecIn[Input] --> DecMSA[Masked Multi-Head Attention]
DecMSA --> DecAddNorm1[Add & Normalize]
DecAddNorm1 --> DecCrossAttn[Cross-Attention with Encoder]
DecCrossAttn --> DecAddNorm2[Add & Normalize]
DecAddNorm2 --> DecFF[Feed Forward Network]
DecFF --> DecAddNorm3[Add & Normalize]
DecAddNorm3 --> DecOut[Output]
end
end
DecOut --> Linear[Linear Layer]
Linear --> Softmax[Softmax]
Softmax --> Output[Output Probabilities]
class TokenEmbed,PosEnc embedding
class EncMHA,DecMSA,DecCrossAttn attention
class EncFF,DecFF feedforward
class EncAddNorm1,EncAddNorm2,DecAddNorm1,DecAddNorm2,DecAddNorm3 normalization
Before any sequence enters the Transformer architecture, it undergoes three key processes:
- Tokenization: Breaking text into tokens (more on this in section 2)
- Token Embedding: Converting tokens to vectors (dimension d_model, typically 512-1024)
- Positional Encoding: Adding positional information
The positional encoding is crucial because, unlike RNNs, Transformers process all tokens simultaneously, losing the natural order of the sequence. Positional encoding adds this information back:
graph LR
subgraph "Positional Encoding Formula"
PE["PE(pos, 2i) = sin(pos/10000^(2i/d_model))"]
PE2["PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))"]
end
Where:
- pos is the position of the token in the sequence
- i is the dimension within the embedding vector
- d_model is the embedding dimension (e.g., 512)
This creates a unique pattern for each position that the model can learn to interpret.
Each encoder block contains two main sublayers:
This is the heart of the Transformer. Attention allows the model to focus on relevant parts of the input sequence when producing an output.
graph TD
classDef process fill:#f9d5e5,stroke:#333,stroke-width:1px
classDef calculation fill:#eeac99,stroke:#333,stroke-width:1px
Input[Input: Sequence of Vectors] --> QKV[Linear Projections to Q, K, V]
subgraph "Multi-Head Attention"
QKV --> Head1[Attention Head 1]
QKV --> Head2[Attention Head 2]
QKV --> HeadDots["..."]
QKV --> HeadN[Attention Head N]
subgraph "Single Attention Head"
Q[Query Matrix Q] --> QK[Matrix Multiplication Q×K^T]
K[Key Matrix K] --> QK
QK --> Scale[Scale by 1/√dk]
Scale --> Mask[Optional Masking]
Mask --> Softmax[Softmax]
Softmax --> SoftV[Matrix Multiplication with V]
V[Value Matrix V] --> SoftV
SoftV --> HeadOutput[Head Output]
end
Head1 --> Concat[Concatenate Heads]
Head2 --> Concat
HeadDots --> Concat
HeadN --> Concat
Concat --> Linear[Final Linear Projection]
end
Linear --> Output[Output: Contextualized Vectors]
class QKV,Head1,Head2,HeadN,Concat,Linear process
class Q,K,V,QK,Scale,Mask,Softmax,SoftV calculation
The attention mechanism is defined by the formula:
Where:
- Q (Query): Represents what we're looking for
- K (Key): Represents what we have (potential matches)
- V (Value): Represents what we want to retrieve
- √dk: Scaling factor (dimension of keys) to prevent vanished gradients
In a self-attention layer, Q, K, and V all come from the same source (the previous layer).
The "multi-head" part means running multiple attention operations in parallel, with each head potentially focusing on different aspects of the relationship between tokens. This is analogous to having multiple "representation subspaces."
After attention, each position is processed by a simple feed-forward network:
This is a two-layer neural network with a ReLU activation in between, applied identically to each position.
Both sublayers employ:
- Residual connection: Adding the input to the sublayer output (helps with gradient flow)
- Layer normalization: Normalizing the outputs to have mean 0 and variance 1 (stabilizes training)
The complete formula for each sublayer is:
The decoder has three sublayers:
- Masked Multi-Head Attention: Like the encoder's attention, but with masking to prevent attending to future tokens (critical for autoregressive generation)
- Cross-Attention: Attends to the encoder's output, allowing the decoder to focus on relevant parts of the input when generating each output token
- Feed-Forward Network: Identical to the encoder's FFN
The masking in the first sublayer is crucial for training. Since the decoder predicts the next token based on previous ones, we must prevent it from "cheating" by looking at future tokens. This is done by setting future positions to negative infinity before the softmax, effectively zeroing them out.
The original Transformer architecture has evolved into three main variants:
graph TD
subgraph "Transformer Variants"
Original[Original Transformer]
Original --> EncoderOnly[Encoder-Only]
Original --> DecoderOnly[Decoder-Only]
Original --> EncoderDecoder[Encoder-Decoder]
subgraph "Examples"
EncoderOnly --> BERT[BERT, RoBERTa]
DecoderOnly --> GPT[GPT, Claude, LLaMA]
EncoderDecoder --> T5[T5, BART]
end
subgraph "Use Cases"
BERT --> Classification[Classification, NER, QA]
GPT --> Generation[Text Generation, Chatbots]
T5 --> Seq2Seq[Translation, Summarization]
end
end
Encoder-only models are bidirectional, meaning they can see the entire input context at once. They excel at understanding tasks:
- Text classification
- Named entity recognition
- Question answering
- Semantic understanding
Decoder-only models are unidirectional/autoregressive, generating one token at a time based on previous tokens. They excel at generation tasks:
- Text completion
- Chatbots
- Creative writing
- Code generation
Modern LLMs like GPT-4, Claude-3.7, DeepSeek R1, and o1 are all decoder-only architectures, though with various proprietary modifications.
graph TD
classDef embedding fill:#f9d5e5,stroke:#333,stroke-width:1px
classDef attention fill:#eeac99,stroke:#333,stroke-width:1px
classDef feedforward fill:#e06377,stroke:#333,stroke-width:1px
classDef normalization fill:#c83349,stroke:#333,stroke-width:1px
Input[Input Tokens] --> TokenEmbed[Token Embeddings]
TokenEmbed --> PosEmbed[Positional Embeddings]
PosEmbed --> FirstBlock[First Decoder Block]
subgraph "Decoder Stack (Nx)"
FirstBlock --> INT1[Intermediate Blocks]
INT1 --> LastBlock[Last Decoder Block]
subgraph "Single Decoder Block"
In[Block Input] --> LN1[Layer Normalization]
LN1 --> Attn[Masked Self-Attention]
Attn --> Residual1[Residual Connection]
Residual1 --> LN2[Layer Normalization]
LN2 --> FFN[Feed-Forward Network]
FFN --> Residual2[Residual Connection]
Residual2 --> Out[Block Output]
end
end
LastBlock --> FinalLN[Final Layer Norm]
FinalLN --> LM[Linear Layer to Vocabulary Size]
LM --> SM[Softmax]
SM --> Output[Output Token Probabilities]
class TokenEmbed,PosEmbed embedding
class Attn attention
class FFN feedforward
class LN1,LN2,FinalLN normalization
These most closely resemble the original Transformer and are designed for sequence-to-sequence tasks:
- Machine translation
- Summarization
- Generative question answering
Tokenization is the first step in processing text for an LLM. It involves breaking text into smaller units called tokens.
graph TD
classDef token fill:#f9d5e5,stroke:#333,stroke-width:1px
Text["The quick brown fox jumps."] --> Tokenization
subgraph "Tokenization Methods"
Tokenization --> Character[Character-Level]
Tokenization --> Word[Word-Level]
Tokenization --> Subword[Subword-Level]
Character --> CharTokens["'T', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ..."]
Word --> WordTokens["'The', 'quick', 'brown', 'fox', 'jumps', '.']
Subword --> SubwordTokens["'The', 'quick', 'bro', '##wn', 'fox', 'jump', '##s', '.']
end
class CharTokens,WordTokens,SubwordTokens token
- Pros: No out-of-vocabulary (OOV) issues, small vocabulary
- Cons: Long sequences, loses semantic meaning of words
- Example: "hello" → ['h', 'e', 'l', 'l', 'o']
- Pros: Preserves semantic units, shorter sequences
- Cons: Large vocabulary, OOV issues, can't handle spelling errors
- Example: "The quick brown fox" → ['The', 'quick', 'brown', 'fox']
- Pros: Balance between character and word levels, handles rare words and morphology
- Cons: More complex algorithms, segmentation errors
- Example: "unbelievable" → ['un', 'believe', 'able']
BPE is an iterative algorithm that starts with individual characters and merges the most common pairs:
graph TD
classDef step fill:#f9d5e5,stroke:#333,stroke-width:1px
Raw["Original: 'low lower lowest'"] --> S1["Step 1: {'l':3, 'o':3, 'w':3, 'e':2, 'r':2, ' ':2, 's':1, 't':1}"]
S1 --> S2["Step 2: Merge 'l' and 'o' → 'lo'"]
S2 --> S3["Result: {'lo':3, 'w':3, 'e':2, 'r':2, ' ':2, 's':1, 't':1}"]
S3 --> S4["Step 3: Merge 'lo' and 'w' → 'low'"]
S4 --> Final["Final Vocab (after iterations): {'low':3, 'e':2, 'r':2, ' ':2, 's':1, 't':1, 'er':1, etc.}"]
class S1,S2,S3,S4 step
BPE is used in GPT models and creates a vocabulary of subwords that balances compression and semantic meaning.
Similar to BPE, but uses a different merging criterion based on likelihood rather than frequency.
Treats the text as a sequence of Unicode characters and applies BPE or unigram language modeling. Important because it's completely language-agnostic and works for any language.
flowchart LR
Input["Input text: 'The transformer architecture is revolutionary.'"] --> Normalization["Text Normalization (lowercase, Unicode normalization)"]
Normalization --> PreTokenization["Pre-tokenization (splitting by whitespace)"]
PreTokenization --> SubwordAlgorithm["Apply Subword Algorithm (BPE, WordPiece, etc.)"]
SubwordAlgorithm --> TokenIDs["Convert to Token IDs using Vocabulary"]
TokenIDs --> Result["Result: [1923, 29871, 1658, 297, 13849, 4]"]
Modern tokenizers include special tokens:
[CLS]
: Start of sequence token[SEP]
: Separator token[PAD]
: Padding token[MASK]
: Masked token (for BERT-style models)[BOS]
,[EOS]
: Beginning/end of sequence[UNK]
: Unknown token for OOV words
Embeddings convert discrete tokens into continuous vector representations that capture semantic meaning.
timeline
title Evolution of Word Embedding Methods
One-Hot Encoding : Simple binary vectors, no semantics
Word2Vec (2013) : CBOW and Skip-gram models
GloVe (2014) : Global vectors from co-occurrence statistics
FastText (2016) : Character n-gram embeddings
Contextual Embeddings (2018+) : ELMo, BERT embeddings
In the context of Transformers, each token ID is mapped to a vector of dimension d_model
(typically 768 in BERT-base or 1024 in larger models). These embeddings are learned during training and encode semantic relationships between words.
graph TD
classDef embedding fill:#f9d5e5,stroke:#333,stroke-width:1px
Token1["Token: 'cat'"] --> ID1["Token ID: 892"]
Token2["Token: 'dog'"] --> ID2["Token ID: 1437"]
Token3["Token: 'kitten'"] --> ID3["Token ID: 4291"]
ID1 --> Lookup[Embedding Lookup Table]
ID2 --> Lookup
ID3 --> Lookup
Lookup --> Vec1["Vector: [0.2, -0.1, 0.5, ..., 0.3]"]
Lookup --> Vec2["Vector: [0.1, -0.2, 0.4, ..., 0.2]"]
Lookup --> Vec3["Vector: [0.25, -0.15, 0.55, ..., 0.35]"]
Vec1 --> Space[Embedding Space]
Vec2 --> Space
Vec3 --> Space
class Vec1,Vec2,Vec3 embedding
After embedding, semantically similar words are close to each other in the vector space. For example, 'cat' and 'kitten' would be closer to each other than 'cat' and 'democracy'.
In the original Transformer, positional information was added using sine and cosine functions. Modern models often use learned positional embeddings instead, which are added directly to the token embeddings:
graph LR
TokenEmb[Token Embedding] --> Add["+"]
PosEmb[Positional Embedding] --> Add
Add --> FinalEmb[Final Embedding Input]
A unique vector for each position up to a maximum sequence length.
Encodes relative distance between tokens rather than absolute positions. Used in transformers like T5 and improved models.
Used in models like GPT-NeoX and newer GPT variants, applies a rotation to the embedding space based on position. Better handles extrapolation to longer sequences.
Parameters in a language model are the learnable values that are adjusted during training to optimize the model's performance. These include:
- Weights: Values in dense or convolutional layers that determine how inputs are transformed
- Biases: Offset values added to the output of a layer
- Embedding matrices: Special weight matrices that convert tokens to vectors
When we refer to a model's "size" (e.g., "GPT-3 has 175 billion parameters"), we're counting the total number of these learnable values. The parameter count is a key factor in a model's capability and computational requirements.
graph LR
subgraph "LLM Parameters Overview"
Parameters["Total Parameters"] --> EmbedMatrix["Embedding Matrix<br>(~15-30% of parameters)"]
Parameters --> AttnParams["Attention Parameters<br>(Q, K, V projections)"]
Parameters --> FFNParams["Feed-Forward Network<br>(Usually largest portion)"]
Parameters --> LNParams["Layer Normalization<br>(Small portion)"]
end
While often used interchangeably, there's a subtle distinction:
- Parameters is the broader term encompassing all learnable values
- Weights specifically refers to the multiplicative parameters (as opposed to biases)
In practical discussions, "weights" often means the entire set of model parameters, particularly when discussing saving or loading model checkpoints (often called "weights files").
The embedding matrix is one of the largest parameter collections in a model:
graph TD
subgraph "Embedding Matrix Structure"
Vocab["Vocabulary<br>(e.g., 50,000 tokens)"] --> Matrix["Embedding Matrix<br>(50,000 × d_model)"]
Matrix --> V1["Vector for token 1<br>(d_model dimensions)"]
Matrix --> V2["Vector for token 2<br>(d_model dimensions)"]
Matrix --> V3["..."]
Matrix --> Vn["Vector for token n<br>(d_model dimensions)"]
end
The embedding matrix:
- Has dimensions [vocab_size × embedding_dimension]
- Contains one vector per token in the vocabulary
- Is initialized randomly and learned during training
- Encodes semantic relationships between tokens in a continuous vector space
- Is not just a one-hot encoding of words, but dense vectors learned to capture meaning
A model uses this matrix as a lookup table: when processing token ID 234, it retrieves row 234 from the embedding matrix. The resulting embedding vector represents that token's meaning in a form the model can process.
In most models, both the input embedding matrix and the output projection (which converts final representations back to vocabulary logits) often share the same weights, reducing the total parameter count.
The relationship between parameters and model performance follows scaling laws:
- Increasing parameters generally improves performance (with sufficient training data)
- Performance scales as a power law with parameter count
- Doubling parameters typically gives diminishing (but predictable) improvements
This relationship has driven the trend toward increasingly large models, from millions to trillions of parameters, though efficient architectures can sometimes achieve better performance with fewer parameters.
Before attention, models struggled with long-range dependencies in text. For example, in the sentence: "The trophy didn't fit in the suitcase because it was too big", what does "it" refer to? A human knows "it" refers to "trophy", but models without attention struggle with this.
Self-attention computes relationships between all words in a sequence by:
-
Creating three projections from each token's embedding:
- Query (Q): What the token is looking for
- Key (K): What the token advertises about itself
- Value (V): The information the token carries
-
Computing attention scores between all tokens:
- For token i, compute a score with every token j using the dot product of Q_i and K_j
- Scale the scores by 1/√d_k to prevent vanishing gradients with large dimensions
- Apply softmax to get a probability distribution
-
Computing the weighted sum of values:
- Multiply each value vector by its attention score
- Sum these weighted values to get the output for token i
graph TD
classDef embedding fill:#f9d5e5,stroke:#333,stroke-width:1px
classDef matrix fill:#eeac99,stroke:#333,stroke-width:1px
classDef operation fill:#e06377,stroke:#333,stroke-width:1px
InputText["'The cat sits on the mat'"] --> Embeddings[Token Embeddings]
Embeddings --> Q[Query Projection]
Embeddings --> K[Key Projection]
Embeddings --> V[Value Projection]
subgraph "Self-Attention Calculation"
Q --> QK[Q×K^T Matrix]
K --> QK
QK --> Scale["Scale by 1/√dk"]
Scale --> Mask[Optional Masking]
Mask --> Softmax[Softmax]
Softmax --> AV[Attention Weights × Value Matrix]
V --> AV
AV --> Output[Output: Contextualized Embeddings]
end
subgraph "Example for word 'cat'"
Ex1["'cat' Query"] --> ExQK["Scores with all words:"]
ExQK --> Scores["'The': 0.1, 'cat': 0.7, 'sits': 0.15, 'on': 0.01, 'the': 0.02, 'mat': 0.02"]
Scores --> ExSM["After Softmax:"]
ExSM --> Probs["'The': 0.08, 'cat': 0.78, 'sits': 0.11, 'on': 0.01, 'the': 0.01, 'mat': 0.01"]
Probs --> ExFinal["Final embedding: 0.08×V_The + 0.78×V_cat + 0.11×V_sits + ..."]
end
class Embeddings embedding
class QK,Scale,Softmax,AV matrix
class Q,K,V operation
Instead of performing a single attention function, transformers use multiple attention heads in parallel:
graph TD
InputEmb[Input Embeddings] --> Split[Split Into Multiple Heads]
Split --> H1[Head 1]
Split --> H2[Head 2]
Split --> HD[...]
Split --> HN[Head N]
H1 --> A1[Self-Attention 1]
H2 --> A2[Self-Attention 2]
HD --> AD[...]
HN --> AN[Self-Attention N]
A1 --> Concat[Concatenate]
A2 --> Concat
AD --> Concat
AN --> Concat
Concat --> Linear[Linear Projection]
Linear --> Output[Output]
Each head can focus on different aspects of relationships between tokens:
- One head might focus on subject-verb relationships
- Another might track pronoun references
- Others might focus on semantic similarity
Used in decoder blocks to prevent tokens from attending to future positions during training.
Only computes attention for a subset of token pairs, reducing computational complexity.
Restricts attention to a local window around each token.
Used in models like Claude-3 and Mistral, only attends to the most recent tokens within a sliding window.
Used in DeepSeek R1 and PaLM, reduces computation by sharing key and value projections across multiple query heads.
A more efficient implementation that reduces memory usage and increases speed.
graph TD
subgraph "Main Pre-training Objectives"
CLM[Causal Language Modeling] --> NextToken[Predict Next Token]
MLM[Masked Language Modeling] --> PredictMasked[Predict Masked Tokens]
end
NextToken --> GPT[Used in GPT, Claude, LLaMA]
PredictMasked --> BERT[Used in BERT, RoBERTa]
The model predicts the next token given all previous tokens. This is the primary objective for decoder-only models like GPT and Claude.
The model predicts masked tokens given the surrounding context. This is used in encoder models like BERT.
Training modern LLMs requires significant computational resources:
- Thousands of GPUs/TPUs running for weeks or months
- Distributed training across multiple machines
- Mixed precision training to save memory
- Gradient checkpointing to handle large models
- Optimized implementations of attention and other operations
flowchart TD
Data[Raw Text Data] --> Preprocess[Preprocessing & Cleaning]
Preprocess --> Tokenize[Tokenization]
Tokenize --> Batching[Batching & Sampling]
subgraph "Training Loop"
Batching --> Forward[Forward Pass]
Forward --> Loss[Loss Calculation]
Loss --> Backward[Backward Pass]
Backward --> Optimize[Optimizer Step]
Optimize --> Forward
end
Optimize --> Checkpoint[Model Checkpoint]
Checkpoint --> Evaluate[Evaluation]
Evaluate --> Repeat[Repeat or Finish]
After pre-training, LLMs are fine-tuned for specific use cases:
Training on task-specific examples with explicit supervision.
Uses human preferences to guide model behavior:
- Train a preference model on human comparisons of outputs
- Use reinforcement learning to optimize the model based on the preference model
A more efficient alternative to RLHF that directly optimizes for human preferences without a separate reward model.
Methods that update only a small subset of model parameters:
- LoRA (Low-Rank Adaptation)
- QLoRA (Quantized LoRA)
- Adapters
- Prompt tuning
LLM performance doesn't scale linearly with model size or computation. Several important scaling laws have been discovered:
graph LR
subgraph "Scaling Laws"
direction TB
Loss["Loss (L)"] --> SL["L ∝ N⁻ᵃ"]
SL --> SL2["a ≈ 0.076"]
Parameters["Model Size (N)"] --> SL
CompLoss["Loss (L)"] --> SC["L ∝ C⁻ᵇ"]
SC --> SC2["b ≈ 0.050"]
Compute["Compute (C)"] --> SC
DataLoss["Loss (L)"] --> SD["L ∝ D⁻ᶜ"]
SD --> SD2["c ≈ 0.095"]
Data["Dataset Size (D)"] --> SD
end
Research by DeepMind showed that models were typically under-trained relative to their size. The Chinchilla paper suggested that for optimal performance:
- Models should be trained on ~20x more tokens than they have parameters
- Computing budget is better spent on training a smaller model on more data than a larger model on less data
graph TD
subgraph "Chinchilla Optimal Scaling"
direction LR
Budget[Compute Budget] --> Split[Split Budget]
Split --> ModelSize[Model Size]
Split --> DataSize[Data Size]
ModelSize --> Relationship["N ∝ C¹/²"]
DataSize --> Relationship2["D ∝ C¹/²"]
Relationship --> OptimalAllocation["Optimal Allocation"]
Relationship2 --> OptimalAllocation
end
Modern models implement various techniques to improve efficiency:
flowchart TD
subgraph "Efficiency Techniques"
MoE[Mixture of Experts] --> SparseActivation["Sparse Activation of Parameters"]
GQA[Grouped-Query Attention] --> SharedKV["Shared Key-Value Projections"]
FA[Flash Attention] --> EfficientMM["Efficient Matrix Multiplication"]
QT[Quantization] --> LowerPrecision["Lower Precision Computation"]
end
subgraph "Memory vs Speed Tradeoffs"
KVCache["KV Caching"] --> FasterInference["Faster Inference"]
GC["Gradient Checkpointing"] --> ReducedMemory["Reduced Memory Usage"]
GC --> SlowerTraining["Slower Training"]
end
LLMs are evaluated on various benchmarks to measure their capabilities:
- MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects
- ARC (AI2 Reasoning Challenge): Elementary to college-level scientific questions
- GSM8K: Grade school math problems
- MATH: High school/college-level math problems
- BBH (Big Bench Hard): Collection of reasoning tasks
- HumanEval: Python programming problems
- MBPP (Mostly Basic Python Programming): Basic Python functions
For specific domains and use cases, custom benchmarks can be created to test:
- Domain-specific knowledge
- Industry-specific reasoning
- Task-specific generation capabilities
Building a commercial LLM service involves:
graph TD
subgraph "LLM Production Pipeline"
Custom[Customer Requirements] --> Data[Data Collection & Preparation]
Data --> FT[Fine-tuning Infrastructure]
FT --> Eval[Evaluation & Benchmarking]
Eval --> Deploy[Deployment]
Deploy --> Monitor[Monitoring & Feedback]
Monitor --> Iterate[Continuous Improvement]
Iterate --> Custom
end
Key components include:
- Model serving infrastructure
- API design
- Customization pipelines
- Monitoring and evaluation
- Security and data governance
- Vaswani et al. (2017). "Attention Is All You Need"
- Brown et al. (2020). "Language Models are Few-Shot Learners"
- Ouyang et al. (2022). "Training Language Models to Follow Instructions"
- Touvron et al. (2023). "LLaMA: Open and Efficient Foundation Language Models"
- Anthropic (2023). "Constitutional AI: Harmlessness from AI Feedback"
- DeepSeek-AI (2023). "DeepSeek: Scaling Language Models through Modular Design"
Building transformers from scratch requires careful attention to numerical stability:
flowchart TD
subgraph "Numerical Stability Issues"
Softmax["Softmax Saturation"] --> ScaleDotProduct["Scale Dot Product (÷√d)"]
GradientVanishing["Gradient Vanishing"] --> SkipConnections["Add Skip Connections"]
TrainingInstability["Training Instability"] --> LayerNorm["Layer Normalization"]
ExplodingGradients["Exploding Gradients"] --> GradClip["Gradient Clipping"]
end
When the dot products become too large, softmax becomes saturated, causing gradients to vanish. Scaling by 1/√d_k helps maintain variance.
There are different approaches to where layer normalization should be applied:
- Pre-LN: Apply normalization before attention and feed-forward blocks (more stable training)
- Post-LN: Apply normalization after attention and feed-forward blocks (original Transformer design)
graph TD
subgraph "Pre-LayerNorm (More Stable)"
Inp1[Input] --> LN1[LayerNorm]
LN1 --> Attn1[Self-Attention]
Attn1 --> Add1[+ Input]
Add1 --> LN2[LayerNorm]
LN2 --> FF1[Feed-Forward]
FF1 --> Add2[+ Add1]
end
subgraph "Post-LayerNorm (Original)"
Inp2[Input] --> Attn2[Self-Attention]
Attn2 --> Add3[+ Input]
Add3 --> LN3[LayerNorm]
LN3 --> FF2[Feed-Forward]
FF2 --> Add4[+ LN3]
Add4 --> LN4[LayerNorm]
end
When implementing transformers from scratch, these optimizations improve performance:
flowchart LR
subgraph "Memory Optimizations"
GC["Gradient Checkpointing"] --> MemTradeoff["Memory↓ Speed↓"]
MP["Mixed Precision"] --> MP16["FP16/BF16 + FP32"]
ZeRO["ZeRO Optimizer"] --> SharedParams["Partitioned Parameters"]
end
subgraph "Speed Optimizations"
FA["Flash Attention"] --> TileMatmul["Tiled Matrix Multiply"]
Kernels["Custom CUDA Kernels"] --> MHA["Optimized Multi-Head Attention"]
TP["Tensor Parallelism"] --> DistAttn["Distributed Attention Heads"]
end
Implements attention with significantly lower memory footprint:
- Recomputes attention on the fly during backward pass
- Uses block-wise processing to optimize cache usage
- Avoids materializing the full attention matrix
For autoregressive generation, properly implementing KV-cache is crucial:
- Store Key and Value states from previous tokens
- Only compute new tokens' K/V and attention with cached values
- Manage memory growth for long sequences
graph TD
subgraph "KV-Cache for Efficient Inference"
Start["Initial Prompt"] --> GenT1["Generate Token 1"]
GenT1 --> KV1["Cache K/V States"]
KV1 --> GenT2["Generate Token 2"]
GenT2 --> KV2["Append to K/V Cache"]
KV2 --> GenT3["Generate Token 3"]
GenT3 --> KV3["Append to K/V Cache"]
KV3 --> GenNext["..."]
end
When building transformers, these are common failure modes and solutions:
- Exploding Loss: Check normalization layers, initialization, and learning rates
- Memorization Instead of Generalization: Verify dataset quality and diversity
- Slow Convergence: Consider warmup schedules and proper optimizer settings
- Attention Pattern Problems: Visualize attention weights to debug misbehavior
- GPU Out-of-Memory: Implement gradient accumulation and efficient attention
flowchart TD
subgraph "Debugging Decision Tree"
Problem["Training Problem"] --> ExpLoss["Exploding Loss?"]
ExpLoss -- Yes --> NormCheck["Check Normalizations"]
ExpLoss -- No --> Converge["Slow Convergence?"]
Converge -- Yes --> WarmupLR["Use Warmup & Adjust LR"]
Converge -- No --> Memory["OOM Errors?"]
Memory -- Yes --> AccumBatch["Use Gradient Accumulation"]
end
Follow the tasks in TODO.md to begin working through this project.
This section provides a practical, step-by-step guide to the entire LLM process, from raw data to a deployed model capable of generating text via prompts.
flowchart TD
subgraph "1. Data Preparation"
Data[Raw Text Data] --> Clean[Text Cleaning]
Clean --> Split[Train/Validation Split]
Split --> Format[Data Formatting]
end
subgraph "2. Tokenization"
Format --> Train_Tokenizer[Train Tokenizer]
Train_Tokenizer --> Encode_Data[Encode Dataset]
end
subgraph "3. Model Architecture"
Model_Def[Define Model Architecture] --> Init[Initialize Weights]
end
subgraph "4. Pre-training"
Encode_Data --> Batch[Create Batches]
Init --> Forward[Forward Pass]
Batch --> Forward
Forward --> Loss[Compute Loss]
Loss --> Backward[Backward Pass]
Backward --> Optimize[Optimizer Step]
Optimize --> Checkpoint[Save Checkpoint]
Checkpoint --> Evaluate[Evaluate on Validation]
Evaluate --> Continue{Continue?}
Continue -->|Yes| Forward
Continue -->|No| Pretrained[Pre-trained Model]
end
subgraph "5. Fine-tuning"
Pretrained --> SFT[Supervised Fine-Tuning]
SFT --> RLHF[Reinforcement Learning from Human Feedback]
RLHF --> Final[Production-Ready Model]
end
subgraph "6. Deployment"
Final --> Quantize[Quantize Model]
Quantize --> Serve[Model Serving]
Serve --> API[API Layer]
API --> Monitor[Monitoring & Logging]
end
- Python 3.8+
- Data processing libraries: pandas, numpy
- Text processing libraries: nltk, spaCy
- Dataset libraries: datasets (HuggingFace)
-
Data Collection
# Download datasets from HuggingFace python -m datasets download --save-path ./data [dataset_name] # Or for custom data python src/data/collect_data.py --source [source] --output ./data/raw
-
Data Cleaning
# Run cleaning and preprocessing script python src/data/preprocess.py --input ./data/raw --output ./data/processed
This involves:
- Removing HTML tags, special characters
- Deduplication of texts
- Language filtering
- Quality filtering (length, perplexity)
-
Train/Validation Split
# Create data splits python src/data/create_splits.py --input ./data/processed --output ./data/splits --train-ratio 0.9
- Custom tokenizer (src/tokenizer/bpe.py) or
- HuggingFace tokenizers library
- SentencePiece library (for production systems)
-
Train a Tokenizer
# Using our custom BPE implementation python src/tokenizer/train.py --input ./data/splits/train.txt --vocab-size 50000 --output ./models/tokenizer # Or using HuggingFace tokenizers python -c "from tokenizers import Tokenizer, models, trainers; tokenizer = Tokenizer(models.BPE()); trainer = trainers.BpeTrainer(vocab_size=50000); tokenizer.train_from_iterator([open('./data/splits/train.txt').read()]); tokenizer.save('./models/tokenizer/tokenizer.json')"
-
Encode the Dataset
python src/data/encode.py --tokenizer ./models/tokenizer --input ./data/splits --output ./data/tokenized
This process:
- Loads the trained tokenizer
- Converts each text into token IDs
- Saves encoded datasets for efficient loading during training
- PyTorch 1.9+
- Optional: HuggingFace Transformers (for reference implementations)
-
Define the Model
# In src/model/transformer.py class TransformerConfig: def __init__(self, vocab_size=50000, d_model=768, n_layers=12, n_heads=12, d_ff=3072, max_seq_len=2048): self.vocab_size = vocab_size self.d_model = d_model self.n_layers = n_layers self.n_heads = n_heads self.d_ff = d_ff self.max_seq_len = max_seq_len class Transformer(nn.Module): def __init__(self, config): super().__init__() # Token embeddings self.token_embedding = nn.Embedding(config.vocab_size, config.d_model) # Positional embeddings self.position_embedding = nn.Embedding(config.max_seq_len, config.d_model) # Transformer layers (decoder-only for simplicity) self.layers = nn.ModuleList([ TransformerLayer(config) for _ in range(config.n_layers) ]) # Output head self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False) # Layer norm self.ln_f = nn.LayerNorm(config.d_model) def forward(self, input_ids): # Get embeddings x = self.token_embedding(input_ids) # Add positional embeddings positions = torch.arange(0, input_ids.size(1), device=input_ids.device).unsqueeze(0) x = x + self.position_embedding(positions) # Apply transformer layers for layer in self.layers: x = layer(x) # Apply final layer norm x = self.ln_f(x) # Get logits logits = self.lm_head(x) return logits
-
Initialize the Model
python src/model/init_model.py --config ./configs/model_config.json --output ./models/initial
Key architectural choices:
- Number of layers and attention heads
- Hidden dimension size
- Feed-forward dimension
- Attention type (vanilla, multi-query, grouped-query)
- Context window length
- PyTorch 1.9+
- Training utilities: torch.distributed for multi-GPU
- Optional: DeepSpeed, Accelerate for large-scale training
- Monitoring: wandb, tensorboard
-
Set Up Training Configuration
# Create training config cat > ./configs/training_config.json << EOL { "batch_size": 32, "gradient_accumulation_steps": 8, "learning_rate": 1e-4, "warmup_steps": 10000, "max_steps": 1000000, "weight_decay": 0.01, "lr_scheduler": "cosine", "eval_interval": 5000 } EOL
-
Run Pre-training
# Single GPU training python src/training/pretrain.py --model ./models/initial --data ./data/tokenized --output ./models/pretrained --config ./configs/training_config.json # Multi-GPU training with torch.distributed python -m torch.distributed.launch --nproc_per_node=8 src/training/pretrain.py --model ./models/initial --data ./data/tokenized --output ./models/pretrained --config ./configs/training_config.json
The training process:
- Loads batches of encoded text
- Performs causal language modeling (predicts next token)
- Computes loss function (cross-entropy)
- Performs backward pass to compute gradients
- Updates model weights using optimizer (AdamW)
- Periodically evaluates on validation set
- Saves checkpoints
-
Monitor Training
# Launch TensorBoard to track progress tensorboard --logdir ./models/pretrained/logs
Metrics to monitor:
- Training loss
- Validation loss
- Learning rate
- Gradient norm
- Validation perplexity
- Same as pre-training
- PEFT library for parameter-efficient fine-tuning
- Datasets formatted for instruction following
-
Supervised Fine-Tuning (SFT)
# Prepare instruction dataset python src/data/prepare_instructions.py --input ./data/raw_instructions --output ./data/sft_data # Run SFT python src/training/finetune.py --model ./models/pretrained --data ./data/sft_data --output ./models/sft --method full # Or with LoRA for parameter-efficient fine-tuning python src/training/finetune.py --model ./models/pretrained --data ./data/sft_data --output ./models/sft_lora --method lora --lora-r 8 --lora-alpha 16
-
Reinforcement Learning from Human Feedback (RLHF)
# Train reward model on human preferences python src/training/train_reward_model.py --model ./models/sft --data ./data/preferences --output ./models/reward_model # Run RLHF using PPO python src/training/rlhf.py --policy-model ./models/sft --reward-model ./models/reward_model --output ./models/rlhf
Or alternatively, use Direct Preference Optimization (DPO):
python src/training/dpo.py --model ./models/sft --data ./data/preferences --output ./models/dpo
- Evaluation libraries: lm-eval-harness
- Custom benchmarking tools
-
Run Standard Benchmarks
# Run MMLU benchmark python src/benchmarks/run_benchmark.py --model ./models/rlhf --benchmark mmlu --output ./results/mmlu # Run HumanEval for code python src/benchmarks/run_benchmark.py --model ./models/rlhf --benchmark humaneval --output ./results/humaneval
-
Custom Task Evaluation
# Run on domain-specific tasks python src/benchmarks/evaluate_custom.py --model ./models/rlhf --tasks ./data/custom_tasks --output ./results/custom
-
Analyze Results
python src/benchmarks/analyze_results.py --input ./results --output ./reports
- Model serving: ONNX Runtime, TensorRT, TorchServe, or vLLM
- API framework: FastAPI, Flask
- Containerization: Docker
- Orchestration: Kubernetes
-
Model Optimization
# Quantize model to int8 python src/serving/quantize.py --model ./models/rlhf --output ./models/production --quantization int8
-
Setup Model Server
# Export for TorchServe torch-model-archiver --model-name binary_llm --version 1.0 --model-file ./src/serving/model.py --serialized-file ./models/production/pytorch_model.bin --handler ./src/serving/handler.py --export-path ./models/serve # Launch model server torchserve --start --model-store ./models/serve --models binary_llm=binary_llm.mar --ncs
-
Deploy API Server
# Run FastAPI server uvicorn src.api.server:app --host 0.0.0.0 --port 8000 # Or with Docker docker build -t binary_llm:latest . docker run -p 8000:8000 binary_llm:latest
-
Setup Monitoring
# Deploy Prometheus and Grafana for metrics docker-compose -f monitoring/docker-compose.yml up -d
Here's a complete list of software needed for the entire pipeline:
Stage | Core Software | Optional Tools |
---|---|---|
Data Preparation | Python, pandas, nltk, spaCy | HuggingFace datasets |
Tokenization | Custom BPE, SentencePiece | HuggingFace tokenizers |
Model Architecture | PyTorch | HuggingFace transformers |
Pre-training | PyTorch, torch.distributed | DeepSpeed, Accelerate, wandb |
Fine-tuning | PyTorch, PEFT | TRL (for RLHF) |
Evaluation | lm-eval-harness | Custom benchmarking tools |
Deployment | ONNX Runtime, FastAPI | TensorRT, Docker, Kubernetes |
-
Hardware Requirements:
- Pre-training: Multiple high-end GPUs (8× NVIDIA A100s or better)
- Fine-tuning: 1-4 GPUs depending on model size
- Inference: GPU with sufficient VRAM for model size or CPU for quantized models
-
Time Estimates:
- Tokenizer training: 1-4 hours
- Pre-training: Days to months depending on dataset size and computing resources
- Fine-tuning: Several hours to a few days
- Deployment setup: Several hours
-
Cost Estimates:
- Cloud GPU costs: $1-10 per hour per GPU
- Pre-training a mid-sized model (1-10B parameters): $10,000-$100,000
- Fine-tuning: $100-$1,000
-
Practical Alternatives:
- Use pre-trained open-source models (LLaMA, Mistral) instead of pre-training
- Use parameter-efficient fine-tuning (LoRA) to reduce costs
- Start with smaller models for experimentation
By following this pipeline, you can build, train, and deploy your own LLM from raw data to a production-ready system.