The teacher taught the student The teacher taughtstudent SELF ATTENTION MAP GENERATIVE AI WITH LLMs · COURSERA Transformers & Pre-training USE CASES · LIFECYCLE · SCALING LAWS · WEEK 1

Course Notes · Generative AI with LLMs · Coursera

Transformers, Pre-training & the GenAI Lifecycle

Week 1 deep-dive: what LLMs are, how the Transformer architecture works from tokenisation through softmax, three model variants, prompting strategies, generative configuration, the project lifecycle, quantization, distributed training, and Chinchilla scaling laws.

Transformers Pre-training Scaling Laws Coursera · Generative AI with LLMs · Week 1

01 Generative AI & Large Language Models

Generative AI — machines capable of creating content that mimics human ability — is a subset of traditional machine learning. The models learn by finding statistical patterns in massive datasets of human-created content.

Large Language Models (LLMs) are trained on billions of words from the internet. Base models include GPT, FLAN-T5, BERT, LLaMA, BLOOM, and PaLM. Models are differentiated by parameter count — larger models have more memory and can tackle more complex tasks. Smaller models can be fine-tuned for narrow, focused tasks.

The subjective understanding of language the model has increases as the number of parameters increases — this is what processes and solves the task the user has prompted.

Key Terminology

Prompt
Human input to the model to perform a task
Context Window
Memory available to the prompt — typically a few thousand words; differs per model
Completion
The output the model generates in response to a prompt
Inference
The act of using a trained model to generate text

LLM Use Cases & Tasks

Predicting the next word is the basis of all LLM output. From this single objective, a vast range of tasks emerge:

  • Essay and long-form writing
  • Summarise long text into short output
  • Machine translation between languages
  • Generate code from natural language
  • Information retrieval — named entity extraction
  • Augmenting LLMs by interacting with external APIs and databases — providing information not available at pre-training time

02 Transformer Architecture

Before Transformers — RNNs

Recurrent Neural Networks (RNNs) were powerful for their time but limited by computing and memory requirements. Even after scaling, they struggled to look beyond immediately previous words to understand language in the context of an entire sentence, paragraph, or document.

The breakthrough came in 2017 with "Attention is All You Need", published by Google and the University of Toronto. Transformers allowed efficient scaling, parallel processing of input data, and the ability to learn which words to pay attention to.

The power of the Transformer lies in its ability to learn the mapping between every word in a sentence and every other word — simultaneously, not sequentially. This is Self-Attention.

ENCODER Multi-Head Attention Feed Forward Network Positional Encoding Embedding Layer Inputs DECODER Softmax Output Feed Forward Network Cross-Attention (Enc→Dec) Masked Self-Attention Inputs + <SOS> encoding

Step-by-step: How data flows

  1. 1
    Tokeniser — converts text to integer token IDs. Must use the same tokeniser for training and inference.
  2. 2
    Embedding layer — maps each token ID to a high-dimensional vector. Vectors encode meaning and context. Original Transformer uses 512-dimensional vectors.
  3. 3
    Positional encoding — added to embeddings to preserve word order, since input is processed in parallel (not sequentially).
  4. 4
    Self-Attention — the model analyses relationships between all tokens simultaneously. Attention weights identify how much each word depends on every other word.
  5. 5
    Multi-Head Attention — multiple sets of attention weights learned in parallel (12–100 heads are common). Each head learns a different linguistic aspect — one head for named entities, another for activities, etc. Weights are randomly initialised and emerge from training.
  6. 6
    Feed-forward network — processes the attended representations into a vector of logits, one per token in the vocabulary.
  7. 7
    Softmax layer — converts logits into probabilities over the full vocabulary — one probability per possible next token.

03 Transformer Model Variants

The encoder and decoder components can be used independently or together, giving rise to three distinct model families:

Encoder-Only
Autoencoding · MLM
  • Masked language modeling — predict masked tokens (denoising)
  • Builds bidirectional context representations
  • Sentiment analysis, NER, word classification
BERT, RoBERTa
Encoder-Decoder
Seq2Seq · Span Corruption
  • Input and output sequences can have different lengths
  • T5 uses span corruption: mask spans, reconstruct via decoder
  • Translation, summarisation, Q&A
T5, BART, BARD
Decoder-Only
Autoregressive · CLM
  • Causal language modeling — predict next token from previous tokens
  • Unidirectional context only
  • Text generation, zero-shot capabilities
GPT, BLOOM, LLaMA, Jurassic

04 Prompting & In-context Learning

Developing and improving a prompt is Prompt Engineering. Providing examples inside the prompt is In-context Learning (ICL) — it helps the model learn the task at hand without updating any weights.

Zero-shot
Only the task description is provided. Large models can often predict correctly; smaller models may fail.
One-shot
One example is included in the prompt alongside the task. Helps smaller models understand the required format.
Few-shot
Multiple examples provided. Even smaller models that fail at one-shot can succeed with more examples — up to the context window limit.
Larger models generally understand tasks better from the description alone. Smaller models benefit significantly from few-shot examples, as long as the examples fit within the context window.

05 Generative Configuration

Configuration parameters invoked at inference time to influence next-token selection:

Max New Tokens

Limits the number of tokens the model generates. Does not guarantee the model will reach that limit — it may emit an <EOS> token earlier.

Greedy vs Random Sampling

  • Greedy (default) — always selects the word with highest probability. Susceptible to repetitive sequences.
  • Random sampling — selects based on probability distribution, sounds more natural. Model can wander and lose coherence.
  • Top-K sampling — randomly sample from the top K highest-probability tokens. Keeps responses relevant without repetition.
  • Top-p sampling — sample from the smallest set of tokens whose cumulative probability ≤ p. Adapts the candidate pool dynamically.

Temperature

A scaling factor applied to the softmax layer that shapes the probability distribution:

Temp < 1 (cold) Strongly peaked → "cake" Temp > 1 (warm) Broader, flatter → more creative variety
Temperature = 1 — default softmax distribution. > 1 — broader, flatter, more creative. < 1 — strongly peaked, more deterministic and conservative.

06 Generative AI Project Lifecycle

1. Scope
Define function — single task or many? Compute budget?
2. Select
Existing foundation model or train your own?
3. Adapt & Align
In-context learning → fine-tuning → RLHF
4. Integrate
Optimise for deployment, build LLM-powered app
  • Scope: decide the function — single task models may only require small models
  • Select: in general, starting with an existing foundation model is better than training from scratch
  • Adapt & Align: start with in-context learning; add fine-tuning and RLHF if performance isn't satisfactory
  • Application integration: optimise for inference, build additional infrastructure as required

07 LLM Pre-training

LLMs encode deep statistical representations of language during pre-training. Model weights are updated to minimise the loss function. LLMs require massive amounts of unstructured text — web scrapes require extensive cleaning, resulting in only 1–3% of original tokens actually being used for training.

Model Cards — read them. They document how the model was trained, what it's good for, its known limitations, and its architecture and pre-training objectives.

Three Pre-training Architectures

ArchitectureTraining ObjectiveContextBest ForExamples
Encoder-onlyMasked Language Modeling (MLM) — predict masked tokensBidirectionalClassification, NER, sentimentBERT, RoBERTa
Decoder-onlyCausal Language Modeling (CLM) — predict next tokenUnidirectionalText generation, zero-shotGPT, BLOOM, LLaMA
Encoder-DecoderSpan corruption — mask spans, reconstructBothTranslation, summarisation, Q&AT5, BART

08 Computation Challenges & Quantization

Training a 1B parameter model requires at minimum ~24GB of GPU memory — 4 bytes per parameter for weights + 8 bytes for Adam optimizer states + 4 bytes for gradients + 8 bytes for activations = ~20 extra bytes per parameter during training.

Quantization reduces memory by lowering numerical precision:

Format
Bits
Exponent
Fraction
Memory / value
FP32
32
8 bits
23 bits
4 bytes — default precision
FP16
16
5 bits
10 bits
2 bytes — half memory½ memory
BFloat16
16
8 bits
7 bits
2 bytes — training stablerecommended
INT8
8
7 bits
1 byte — quarter memory¼ memory

BFloat16 (Google Brain's format) is a hybrid of FP32 and FP16 — it keeps the full 8-bit exponent of FP32 (preserving range) but truncates the fraction from 23 to 7 bits. This leads to training stability while halving memory. Most modern models are trained with BFloat16.

Impact: FP16/BF16 reduces GPU memory from 24GB → 12GB for a 1B model. INT8 reduces it to 6GB. But 100B+ parameter models still require 100s of GPUs — enter distributed training.

09 Multi-GPU Training Strategies

Distributed Data Parallel (DDP)

PyTorch's DDP copies the full model onto each GPU and sends different data batches in parallel. Each GPU processes its batch independently, then a synchronisation step combines gradients and updates all GPU copies identically.

Requires: weights + optimizer states + gradients all fit on a single GPU. Best for: parallel training when model fits on one GPU but you want a speed boost.

Fully Sharded Data Parallel (FSDP) & ZeRO

When the model doesn't fit on a single GPU, use model sharding. PyTorch's FSDP is motivated by Microsoft's ZeRO (Zero Redundancy Optimizer) paper (2019), which distributes model states across GPUs with zero data overlap.

In DDP, weights, optimizer states, and gradients are all stored redundantly on every GPU. ZeRO eliminates this redundancy in three stages:

Baseline (DDP)
Params
Grads
Optimizer
All redundant on every GPU
Stage 1
Params
Grads
Shard optimizer states only (2× savings)
Stage 2
Params
Shard grads + optimizer states
Stage 3
Shard everything — linear savings with GPU count
More GPUs ≠ better indefinitely. Communication overhead between GPUs causes ~7% performance decrease as GPU count increases. Hybrid sharding finds the best tradeoff.

10 Scaling Laws & Chinchilla

Performance (minimising loss) can be improved by increasing dataset size (tokens) or model size (parameters), subject to a compute budget (GPUs × time × cost). Both follow power-law relationships with compute.

Goal: Minimize Loss (Maximize Performance) Model performance Compute budget ↑ Dataset size ↑ Model size ↑

1 petaFLOP/s-day = number of floating point operations at 1 petaFLOP/sec for one full day = 8 NVIDIA V100 GPUs or 2 NVIDIA A100s.

The Chinchilla paper (DeepMind, 2022) found that most large models are over-parameterised and under-trained. The compute-optimal training dataset size is approximately 20× the number of parameters. GPT-3 (175B params) would need ~3.5T tokens — it was only trained on 300B.

Model# ParametersCompute-optimal tokens (~20×)Actual tokens
Chinchilla70B~1.4T1.4T ✓
LLaMA-65B65B~1.3T1.4T ≈ ✓
GPT-3175B~3.5T300B — under-trained
OPT-175B175B~3.5T180B — under-trained
BLOOM176B~3.5T350B — under-trained

Compute-optimal Chinchilla models outperform larger but under-trained models on a wide range of downstream evaluation tasks. Smaller, well-trained models can outperform larger ones.

11 Pre-training for Domain Adaptation

Existing LLMs may not suit specialised domains like legal (rare vocabulary used in different context than general understanding) or medicine (abundant abbreviations and uncommon terms).

BB GPT

BloombergGPT — a large decoder-only model pre-trained for finance. Trained on 51% financial data (news, reports, market data) and 49% public data. Used Chinchilla scaling laws as guidance. Target: 50B params × 1.4T tokens. Reality: 700B tokens acquired, training stopped at 569B due to data scarcity — a good illustration of real-world pre-training tradeoffs.

BloombergGPT demonstrates that domain-specific pre-training is powerful but constrained in practice. The team couldn't acquire the compute-optimal 1.4T finance tokens, so early stopping was necessary — trading compute-optimality for domain coverage.

12 Key Papers & Resources