01 Generative AI & Large Language Models
Generative AI — machines capable of creating content that mimics human ability — is a subset of traditional machine learning. The models learn by finding statistical patterns in massive datasets of human-created content.
Large Language Models (LLMs) are trained on billions of words from the internet. Base models include GPT, FLAN-T5, BERT, LLaMA, BLOOM, and PaLM. Models are differentiated by parameter count — larger models have more memory and can tackle more complex tasks. Smaller models can be fine-tuned for narrow, focused tasks.
The subjective understanding of language the model has increases as the number of parameters increases — this is what processes and solves the task the user has prompted.
Key Terminology
LLM Use Cases & Tasks
Predicting the next word is the basis of all LLM output. From this single objective, a vast range of tasks emerge:
- Essay and long-form writing
- Summarise long text into short output
- Machine translation between languages
- Generate code from natural language
- Information retrieval — named entity extraction
- Augmenting LLMs by interacting with external APIs and databases — providing information not available at pre-training time
02 Transformer Architecture
Before Transformers — RNNs
Recurrent Neural Networks (RNNs) were powerful for their time but limited by computing and memory requirements. Even after scaling, they struggled to look beyond immediately previous words to understand language in the context of an entire sentence, paragraph, or document.
The breakthrough came in 2017 with "Attention is All You Need", published by Google and the University of Toronto. Transformers allowed efficient scaling, parallel processing of input data, and the ability to learn which words to pay attention to.
The power of the Transformer lies in its ability to learn the mapping between every word in a sentence and every other word — simultaneously, not sequentially. This is Self-Attention.
Step-by-step: How data flows
- 1Tokeniser — converts text to integer token IDs. Must use the same tokeniser for training and inference.
- 2Embedding layer — maps each token ID to a high-dimensional vector. Vectors encode meaning and context. Original Transformer uses 512-dimensional vectors.
- 3Positional encoding — added to embeddings to preserve word order, since input is processed in parallel (not sequentially).
- 4Self-Attention — the model analyses relationships between all tokens simultaneously. Attention weights identify how much each word depends on every other word.
- 5Multi-Head Attention — multiple sets of attention weights learned in parallel (12–100 heads are common). Each head learns a different linguistic aspect — one head for named entities, another for activities, etc. Weights are randomly initialised and emerge from training.
- 6Feed-forward network — processes the attended representations into a vector of logits, one per token in the vocabulary.
- 7Softmax layer — converts logits into probabilities over the full vocabulary — one probability per possible next token.
03 Transformer Model Variants
The encoder and decoder components can be used independently or together, giving rise to three distinct model families:
- Masked language modeling — predict masked tokens (denoising)
- Builds bidirectional context representations
- Sentiment analysis, NER, word classification
- Input and output sequences can have different lengths
- T5 uses span corruption: mask spans, reconstruct via decoder
- Translation, summarisation, Q&A
- Causal language modeling — predict next token from previous tokens
- Unidirectional context only
- Text generation, zero-shot capabilities
04 Prompting & In-context Learning
Developing and improving a prompt is Prompt Engineering. Providing examples inside the prompt is In-context Learning (ICL) — it helps the model learn the task at hand without updating any weights.
05 Generative Configuration
Configuration parameters invoked at inference time to influence next-token selection:
Max New Tokens
Limits the number of tokens the model generates. Does not guarantee the model will reach that limit — it may emit an <EOS> token earlier.
Greedy vs Random Sampling
- Greedy (default) — always selects the word with highest probability. Susceptible to repetitive sequences.
- Random sampling — selects based on probability distribution, sounds more natural. Model can wander and lose coherence.
- Top-K sampling — randomly sample from the top K highest-probability tokens. Keeps responses relevant without repetition.
- Top-p sampling — sample from the smallest set of tokens whose cumulative probability ≤ p. Adapts the candidate pool dynamically.
Temperature
A scaling factor applied to the softmax layer that shapes the probability distribution:
06 Generative AI Project Lifecycle
- Scope: decide the function — single task models may only require small models
- Select: in general, starting with an existing foundation model is better than training from scratch
- Adapt & Align: start with in-context learning; add fine-tuning and RLHF if performance isn't satisfactory
- Application integration: optimise for inference, build additional infrastructure as required
07 LLM Pre-training
LLMs encode deep statistical representations of language during pre-training. Model weights are updated to minimise the loss function. LLMs require massive amounts of unstructured text — web scrapes require extensive cleaning, resulting in only 1–3% of original tokens actually being used for training.
Model Cards — read them. They document how the model was trained, what it's good for, its known limitations, and its architecture and pre-training objectives.
Three Pre-training Architectures
| Architecture | Training Objective | Context | Best For | Examples |
|---|---|---|---|---|
| Encoder-only | Masked Language Modeling (MLM) — predict masked tokens | Bidirectional | Classification, NER, sentiment | BERT, RoBERTa |
| Decoder-only | Causal Language Modeling (CLM) — predict next token | Unidirectional | Text generation, zero-shot | GPT, BLOOM, LLaMA |
| Encoder-Decoder | Span corruption — mask spans, reconstruct | Both | Translation, summarisation, Q&A | T5, BART |
08 Computation Challenges & Quantization
Training a 1B parameter model requires at minimum ~24GB of GPU memory — 4 bytes per parameter for weights + 8 bytes for Adam optimizer states + 4 bytes for gradients + 8 bytes for activations = ~20 extra bytes per parameter during training.
Quantization reduces memory by lowering numerical precision:
BFloat16 (Google Brain's format) is a hybrid of FP32 and FP16 — it keeps the full 8-bit exponent of FP32 (preserving range) but truncates the fraction from 23 to 7 bits. This leads to training stability while halving memory. Most modern models are trained with BFloat16.
Impact: FP16/BF16 reduces GPU memory from 24GB → 12GB for a 1B model. INT8 reduces it to 6GB. But 100B+ parameter models still require 100s of GPUs — enter distributed training.
09 Multi-GPU Training Strategies
Distributed Data Parallel (DDP)
PyTorch's DDP copies the full model onto each GPU and sends different data batches in parallel. Each GPU processes its batch independently, then a synchronisation step combines gradients and updates all GPU copies identically.
Fully Sharded Data Parallel (FSDP) & ZeRO
When the model doesn't fit on a single GPU, use model sharding. PyTorch's FSDP is motivated by Microsoft's ZeRO (Zero Redundancy Optimizer) paper (2019), which distributes model states across GPUs with zero data overlap.
In DDP, weights, optimizer states, and gradients are all stored redundantly on every GPU. ZeRO eliminates this redundancy in three stages:
10 Scaling Laws & Chinchilla
Performance (minimising loss) can be improved by increasing dataset size (tokens) or model size (parameters), subject to a compute budget (GPUs × time × cost). Both follow power-law relationships with compute.
1 petaFLOP/s-day = number of floating point operations at 1 petaFLOP/sec for one full day = 8 NVIDIA V100 GPUs or 2 NVIDIA A100s.
The Chinchilla paper (DeepMind, 2022) found that most large models are over-parameterised and under-trained. The compute-optimal training dataset size is approximately 20× the number of parameters. GPT-3 (175B params) would need ~3.5T tokens — it was only trained on 300B.
| Model | # Parameters | Compute-optimal tokens (~20×) | Actual tokens |
|---|---|---|---|
| Chinchilla | 70B | ~1.4T | 1.4T ✓ |
| LLaMA-65B | 65B | ~1.3T | 1.4T ≈ ✓ |
| GPT-3 | 175B | ~3.5T | 300B — under-trained |
| OPT-175B | 175B | ~3.5T | 180B — under-trained |
| BLOOM | 176B | ~3.5T | 350B — under-trained |
Compute-optimal Chinchilla models outperform larger but under-trained models on a wide range of downstream evaluation tasks. Smaller, well-trained models can outperform larger ones.
11 Pre-training for Domain Adaptation
Existing LLMs may not suit specialised domains like legal (rare vocabulary used in different context than general understanding) or medicine (abundant abbreviations and uncommon terms).
BloombergGPT — a large decoder-only model pre-trained for finance. Trained on 51% financial data (news, reports, market data) and 49% public data. Used Chinchilla scaling laws as guidance. Target: 50B params × 1.4T tokens. Reality: 700B tokens acquired, training stopped at 569B due to data scarcity — a good illustration of real-world pre-training tradeoffs.
12 Key Papers & Resources
- Attention is All You Need — the 2017 paper that introduced the Transformer architecture
- Language Models are Few-Shot Learners — the GPT-3 paper on few-shot learning
- Training Compute-Optimal Large Language Models — the Chinchilla paper by DeepMind
- BloombergGPT: A Large Language Model for Finance
- Scaling Laws for Neural Language Models — OpenAI empirical study
- HuggingFace Tasks and Model Hub — practical resources for all ML tasks
- LLaMA: Open and Efficient Foundation Language Models — Meta AI, 13B outperforms GPT-3 175B on most benchmarks