Exploration

The GenAI Lab

A living workspace for agents, skills, LLM concepts, tools, research digests, and model comparisons. Updated as I learn.

Agent Skills

Loading…

What are Skills?

Reusable instructions
for AI agents

A skill is a markdown file — SKILL.md — that gives Claude (or any LLM agent) a precise, reusable set of instructions for a specific task. Think of it as a structured prompt template that lives in your repo and travels with your code.

Instead of re-explaining context every time, you drop a skill into your workflow and the agent knows exactly what to do — which tools to call, how to format output, what constraints apply.

Portable by design Each skill is a plain markdown file — version-controlled, shareable, and model-agnostic.
Composable with MCP & sub-agents Skills plug into Claude's tool-use layer — combine multiple skills in one agentic pipeline.
Provider-agnostic structure The library is organised by provider — Anthropic, OpenAI, and custom — so you can mix and match.

Skill Library

LLM Concepts

Click any concept to read a dedicated page

Foundation & Training

Prompt engineering

Zero-shot, few-shot, chain-of-thought, system prompts

Fine-tuning

LoRA, RLHF, instruction tuning, dataset curation

Tokenization & embeddings

BPE, token limits, vector spaces, semantic similarity

Transformer architecture

Attention, MLP layers, positional encoding, KV cache

Pretraining vs post-training

Pretraining, SFT, RLHF, DPO — how each shapes behavior

Memory & Retrieval

RAG

Chunking, vector DBs, hybrid search, re-ranking

Context window management

Token limits, lost-in-the-middle, chunking strategies

Memory & state for agents

Short-term, long-term, episodic, semantic memory patterns

Inference & Output

Inference & sampling

Temperature, top-p, top-k, greedy vs beam search

Structured outputs

JSON mode, tool use, function calling, Pydantic validation

Hallucination & grounding

Causes, detection, mitigation, citation strategies

Multimodal inputs

Vision, audio, PDF — how LLMs process non-text inputs

Evaluation & Production

Evals & benchmarking

MMLU, HumanEval, LLM-as-judge, custom evals

Cost & latency optimization

Caching, quantization, batching, prompt compression

Safety, alignment & guardrails

Constitutional AI, RLHF, prompt injection, jailbreaks

Agentic Systems

Agent architectures

ReAct, tool use, planning loops, multi-agent coordination

MCP

Tool registration, server-client model, skill integration

Agent skills

SKILL.md pattern, composability, sub-agents, reliability

Tool Stack

Research Digest

Papers I read, summarised as blogs or decks
arXiv · 2605.25188
Blog May 2025

DarkForest: Reducing Debate in Multi-Agent LLMs

How belief aggregation without inter-agent debate yields 30.7% accuracy gains and 6.5× token savings across six reasoning tasks.

arXiv · DeepSeek-R1
Blog Feb 2025

GRPO: Group Relative Policy Optimisation for LLMs

A breakdown of DeepSeek-R1's training recipe — how reward shaping and GRPO replace RLHF for reasoning model training.

📑

More digests coming soon

I'm reading and summarising papers on attention mechanisms, MoE architectures, and agentic planning. Check back regularly.

LLM Benchmarks

A plain-English guide to how AI models are tested
Reference Guide · 2025–2026 LLM Benchmarks Decoded A plain-English guide to how AI models are tested — what each benchmark measures, how it's set up, what the scores mean, and where they can be gamed. 22 Benchmarks 9 Themes Updated June 2026 Read the guide

Data Visualization

Capability rings — every model across every category

Model Landscape

Live LiveBench leaderboard

A live model leaderboard built from LiveBench benchmark CSVs. Pick a category to drill into individual benchmarks, or select multiple dates to see how models move over time.

Top Overall Model
--
Top Coding Model
--
Top Data Analysis Model
--
Top Agentic Coding Model
--
Top Models by Category
Loading the latest leaderboard…
Latest models by thinking effort
Each family's strongest current model, compared by overall score at the same reasoning-effort setting.
Family leaderboards
Latest model versions in each family, ranked by overall score. Inner ring = weaker, outer ring = stronger; darker shades mark the more capable models.
DateModelOverallReasoningCodingAgenticMathDataLanguageInstruction
Loading data…

What the benchmark says right now

Loading the latest LiveBench release to generate a model comparison summary…
The simple takeaway
After loading, this summarizes the current leaders and what they are good at.
Important caveat
LiveBench scores are useful but not a universal answer to "which model is best?" A model that wins on math may not fit cost-sensitive summarization, low-latency chat, safety review, or tool-calling workflows.

Loading data to produce the generated comparison notes…

Live data from the LiveBench public CSV releases. Averages are simple arithmetic means computed in-browser for exploration and may differ from LiveBench's official aggregation.