GenAI Use Cases, Lifecycle & Pre-training

01 Generative AI & Large Language Models

Generative AI — machines capable of creating content that mimics human ability — is a subset of traditional machine learning. The models learn by finding statistical patterns in massive datasets of human-created content.

Large Language Models (LLMs) are trained on billions of words from the internet. Base models include GPT, FLAN-T5, BERT, LLaMA, BLOOM, and PaLM. Models are differentiated by parameter count — larger models have more memory and can tackle more complex tasks. Smaller models can be fine-tuned for narrow, focused tasks.

The subjective understanding of language the model has increases as the number of parameters increases — this is what processes and solves the task the user has prompted.

Key Terminology

Prompt

Human input to the model to perform a task

Context Window

Memory available to the prompt — typically a few thousand words; differs per model

Completion

The output the model generates in response to a prompt

Inference

The act of using a trained model to generate text

LLM Use Cases & Tasks

Predicting the next word is the basis of all LLM output. From this single objective, a vast range of tasks emerge:

Essay and long-form writing
Summarise long text into short output
Machine translation between languages
Generate code from natural language
Information retrieval — named entity extraction
Augmenting LLMs by interacting with external APIs and databases — providing information not available at pre-training time

02 Transformer Architecture

Before Transformers — RNNs

Recurrent Neural Networks (RNNs) were powerful for their time but limited by computing and memory requirements. Even after scaling, they struggled to look beyond immediately previous words to understand language in the context of an entire sentence, paragraph, or document.

The breakthrough came in 2017 with "Attention is All You Need", published by Google and the University of Toronto. Transformers allowed efficient scaling, parallel processing of input data, and the ability to learn which words to pay attention to.

The power of the Transformer lies in its ability to learn the mapping between every word in a sentence and every other word — simultaneously, not sequentially. This is Self-Attention.

Step-by-step: How data flows

1
Tokeniser — converts text to integer token IDs. Must use the same tokeniser for training and inference.
2
Embedding layer — maps each token ID to a high-dimensional vector. Vectors encode meaning and context. Original Transformer uses 512-dimensional vectors.
3
Positional encoding — added to embeddings to preserve word order, since input is processed in parallel (not sequentially).
4
Self-Attention — the model analyses relationships between all tokens simultaneously. Attention weights identify how much each word depends on every other word.
5
Multi-Head Attention — multiple sets of attention weights learned in parallel (12–100 heads are common). Each head learns a different linguistic aspect — one head for named entities, another for activities, etc. Weights are randomly initialised and emerge from training.
6
Feed-forward network — processes the attended representations into a vector of logits, one per token in the vocabulary.
7
Softmax layer — converts logits into probabilities over the full vocabulary — one probability per possible next token.

03 Transformer Model Variants

The encoder and decoder components can be used independently or together, giving rise to three distinct model families:

Encoder-Only

Autoencoding · MLM

Masked language modeling — predict masked tokens (denoising)
Builds bidirectional context representations
Sentiment analysis, NER, word classification

BERT, RoBERTa

Encoder-Decoder

Seq2Seq · Span Corruption

Input and output sequences can have different lengths
T5 uses span corruption: mask spans, reconstruct via decoder
Translation, summarisation, Q&A

T5, BART, BARD

Decoder-Only

Autoregressive · CLM

Causal language modeling — predict next token from previous tokens
Unidirectional context only
Text generation, zero-shot capabilities

GPT, BLOOM, LLaMA, Jurassic

04 Prompting & In-context Learning

Developing and improving a prompt is Prompt Engineering. Providing examples inside the prompt is In-context Learning (ICL) — it helps the model learn the task at hand without updating any weights.

Zero-shot

Only the task description is provided. Large models can often predict correctly; smaller models may fail.

One-shot

One example is included in the prompt alongside the task. Helps smaller models understand the required format.

Few-shot

Multiple examples provided. Even smaller models that fail at one-shot can succeed with more examples — up to the context window limit.

Larger models generally understand tasks better from the description alone. Smaller models benefit significantly from few-shot examples, as long as the examples fit within the context window.

05 Generative Configuration

Configuration parameters invoked at inference time to influence next-token selection:

Max New Tokens

Limits the number of tokens the model generates. Does not guarantee the model will reach that limit — it may emit an <EOS> token earlier.

Greedy vs Random Sampling

Greedy (default) — always selects the word with highest probability. Susceptible to repetitive sequences.
Random sampling — selects based on probability distribution, sounds more natural. Model can wander and lose coherence.
Top-K sampling — randomly sample from the top K highest-probability tokens. Keeps responses relevant without repetition.
Top-p sampling — sample from the smallest set of tokens whose cumulative probability ≤ p. Adapts the candidate pool dynamically.

Temperature

A scaling factor applied to the softmax layer that shapes the probability distribution:

Temperature = 1 — default softmax distribution. > 1 — broader, flatter, more creative. < 1 — strongly peaked, more deterministic and conservative.

06 Generative AI Project Lifecycle

1. Scope

Define function — single task or many? Compute budget?

2. Select

Existing foundation model or train your own?

3. Adapt & Align

In-context learning → fine-tuning → RLHF

4. Integrate

Optimise for deployment, build LLM-powered app

Scope: decide the function — single task models may only require small models
Select: in general, starting with an existing foundation model is better than training from scratch
Adapt & Align: start with in-context learning; add fine-tuning and RLHF if performance isn't satisfactory
Application integration: optimise for inference, build additional infrastructure as required

07 LLM Pre-training

LLMs encode deep statistical representations of language during pre-training. Model weights are updated to minimise the loss function. LLMs require massive amounts of unstructured text — web scrapes require extensive cleaning, resulting in only 1–3% of original tokens actually being used for training.

Model Cards — read them. They document how the model was trained, what it's good for, its known limitations, and its architecture and pre-training objectives.

Three Pre-training Architectures

Architecture	Training Objective	Context	Best For	Examples
Encoder-only	Masked Language Modeling (MLM) — predict masked tokens	Bidirectional	Classification, NER, sentiment	BERT, RoBERTa
Decoder-only	Causal Language Modeling (CLM) — predict next token	Unidirectional	Text generation, zero-shot	GPT, BLOOM, LLaMA
Encoder-Decoder	Span corruption — mask spans, reconstruct	Both	Translation, summarisation, Q&A	T5, BART

08 Computation Challenges & Quantization

Training a 1B parameter model requires at minimum ~24GB of GPU memory — 4 bytes per parameter for weights + 8 bytes for Adam optimizer states + 4 bytes for gradients + 8 bytes for activations = ~20 extra bytes per parameter during training.

Quantization reduces memory by lowering numerical precision:

FP32

8 bits

23 bits

4 bytes — default precision

FP16

5 bits

10 bits

2 bytes — half memory½ memory

BFloat16

8 bits

7 bits

2 bytes — training stablerecommended

INT8

—

7 bits

1 byte — quarter memory¼ memory

BFloat16 (Google Brain's format) is a hybrid of FP32 and FP16 — it keeps the full 8-bit exponent of FP32 (preserving range) but truncates the fraction from 23 to 7 bits. This leads to training stability while halving memory. Most modern models are trained with BFloat16.

Impact: FP16/BF16 reduces GPU memory from 24GB → 12GB for a 1B model. INT8 reduces it to 6GB. But 100B+ parameter models still require 100s of GPUs — enter distributed training.

09 Multi-GPU Training Strategies

Distributed Data Parallel (DDP)

PyTorch's DDP copies the full model onto each GPU and sends different data batches in parallel. Each GPU processes its batch independently, then a synchronisation step combines gradients and updates all GPU copies identically.

Requires: weights + optimizer states + gradients all fit on a single GPU. Best for: parallel training when model fits on one GPU but you want a speed boost.

Fully Sharded Data Parallel (FSDP) & ZeRO

When the model doesn't fit on a single GPU, use model sharding. PyTorch's FSDP is motivated by Microsoft's ZeRO (Zero Redundancy Optimizer) paper (2019), which distributes model states across GPUs with zero data overlap.

In DDP, weights, optimizer states, and gradients are all stored redundantly on every GPU. ZeRO eliminates this redundancy in three stages:

Baseline (DDP)

Params

Grads

Optimizer

All redundant on every GPU

Stage 1

Params

Grads

⅓

Shard optimizer states only (2× savings)

Stage 2

Params

⅓

Shard grads + optimizer states

Stage 3

⅓

Shard everything — linear savings with GPU count

More GPUs ≠ better indefinitely. Communication overhead between GPUs causes ~7% performance decrease as GPU count increases. Hybrid sharding finds the best tradeoff.

10 Scaling Laws & Chinchilla

Performance (minimising loss) can be improved by increasing dataset size (tokens) or model size (parameters), subject to a compute budget (GPUs × time × cost). Both follow power-law relationships with compute.

1 petaFLOP/s-day = number of floating point operations at 1 petaFLOP/sec for one full day = 8 NVIDIA V100 GPUs or 2 NVIDIA A100s.

The Chinchilla paper (DeepMind, 2022) found that most large models are over-parameterised and under-trained. The compute-optimal training dataset size is approximately 20× the number of parameters. GPT-3 (175B params) would need ~3.5T tokens — it was only trained on 300B.

Model	# Parameters	Compute-optimal tokens (~20×)	Actual tokens
Chinchilla	70B	~1.4T	1.4T ✓
LLaMA-65B	65B	~1.3T	1.4T ≈ ✓
GPT-3	175B	~3.5T	300B — under-trained
OPT-175B	175B	~3.5T	180B — under-trained
BLOOM	176B	~3.5T	350B — under-trained

Compute-optimal Chinchilla models outperform larger but under-trained models on a wide range of downstream evaluation tasks. Smaller, well-trained models can outperform larger ones.

11 Pre-training for Domain Adaptation

Existing LLMs may not suit specialised domains like legal (rare vocabulary used in different context than general understanding) or medicine (abundant abbreviations and uncommon terms).

BloombergGPT — a large decoder-only model pre-trained for finance. Trained on 51% financial data (news, reports, market data) and 49% public data. Used Chinchilla scaling laws as guidance. Target: 50B params × 1.4T tokens. Reality: 700B tokens acquired, training stopped at 569B due to data scarcity — a good illustration of real-world pre-training tradeoffs.

BloombergGPT demonstrates that domain-specific pre-training is powerful but constrained in practice. The team couldn't acquire the compute-optimal 1.4T finance tokens, so early stopping was necessary — trading compute-optimality for domain coverage.

12 Key Papers & Resources

Attention is All You Need — the 2017 paper that introduced the Transformer architecture
Language Models are Few-Shot Learners — the GPT-3 paper on few-shot learning
Training Compute-Optimal Large Language Models — the Chinchilla paper by DeepMind
BloombergGPT: A Large Language Model for Finance
Scaling Laws for Neural Language Models — OpenAI empirical study
HuggingFace Tasks and Model Hub — practical resources for all ML tasks
LLaMA: Open and Efficient Foundation Language Models — Meta AI, 13B outperforms GPT-3 175B on most benchmarks