Exploration

The GenAI Lab

A living workspace for agents, skills, LLM concepts, tools, research digests, and model comparisons. Updated as I learn.

Agent Skills

Loading…

What are Skills?

Reusable instructions
for AI agents

A skill is a markdown file — SKILL.md — that gives Claude (or any LLM agent) a precise, reusable set of instructions for a specific task. Think of it as a structured prompt template that lives in your repo and travels with your code.

Instead of re-explaining context every time, you drop a skill into your workflow and the agent knows exactly what to do — which tools to call, how to format output, what constraints apply.

Portable by design Each skill is a plain markdown file — version-controlled, shareable, and model-agnostic.

Composable with MCP & sub-agents Skills plug into Claude's tool-use layer — combine multiple skills in one agentic pipeline.

Provider-agnostic structure The library is organised by provider — Anthropic, OpenAI, and custom — so you can mix and match.

Read the course notes →

Agent Skills
with Anthropic

Open blog post

Skill Library

LLM Concepts

Click any concept to read a dedicated page

Foundation & Training

Prompt engineering

Zero-shot, few-shot, chain-of-thought, system prompts

Fine-tuning

LoRA, RLHF, instruction tuning, dataset curation

Tokenization & embeddings

BPE, token limits, vector spaces, semantic similarity

Transformer architecture

Attention, MLP layers, positional encoding, KV cache

Pretraining vs post-training

Pretraining, SFT, RLHF, DPO — how each shapes behavior

Memory & Retrieval

RAG

Chunking, vector DBs, hybrid search, re-ranking

Context window management

Token limits, lost-in-the-middle, chunking strategies

Memory & state for agents

Short-term, long-term, episodic, semantic memory patterns

Inference & Output

Inference & sampling

Temperature, top-p, top-k, greedy vs beam search

Structured outputs

JSON mode, tool use, function calling, Pydantic validation

Hallucination & grounding

Causes, detection, mitigation, citation strategies

Multimodal inputs

Vision, audio, PDF — how LLMs process non-text inputs

Evaluation & Production

Evals & benchmarking

MMLU, HumanEval, LLM-as-judge, custom evals

Cost & latency optimization

Caching, quantization, batching, prompt compression

Safety, alignment & guardrails

Constitutional AI, RLHF, prompt injection, jailbreaks

Agentic Systems

Agent architectures

ReAct, tool use, planning loops, multi-agent coordination

MCP

Tool registration, server-client model, skill integration

Agent skills

SKILL.md pattern, composability, sub-agents, reliability

Tool Stack

AI Assistants

Claude

Anthropic's flagship model — excellent for coding, analysis, long-context reasoning, and agentic workflows. Claude Sonnet 4 is my primary work environment.

ChatGPT

OpenAI's consumer AI — GPT-4o and o3 models with voice, image generation, browsing, code interpreter, and custom GPTs.

Gemini

Google's multimodal AI — Gemini 2.5 Pro with 1M token context, Deep Research, and tight integration with Google Workspace.

DeepSeek

Chinese open-weight frontier model — DeepSeek-R2 with exceptional reasoning and coding at a fraction of the cost of competitors.

Grok

xAI's model with real-time X (Twitter) data access, Grok 3 with deep reasoning mode, and image generation via Aurora.

Le Chat (Mistral)

Mistral AI's chat interface — fast, European, and multilingual. Mistral Large 2 and the open-source Mistral 7B family.

Perplexity AI

AI-native search engine with real-time web retrieval. Great for research, citations, and grounded answers with source links.

HuggingChat

Open-source chat interface by Hugging Face — run Llama 3.3, Mistral, Qwen, and Command R+ for free with web search.

Frameworks

LangChain

The most popular LLM application framework — chains, retrievers, memory, and 100+ integrations. Foundation for most production RAG pipelines.

LlamaIndex

Data framework for connecting private data to LLMs — indexing, querying, and 100+ data connectors. Best for RAG-heavy applications.

Hugging Face

Open-source model hub, Transformers library, PEFT, datasets, and Spaces — the central ecosystem for open-weight AI development.

PydanticAI

Type-safe agent framework from the Pydantic team — structured outputs, dependency injection, and model-agnostic agent layer with validation built in.

Vercel AI SDK

TypeScript-first SDK for building AI-powered web apps — streaming, tool use, React hooks, and support for all major LLM providers.

Agentic Frameworks

LangGraph

Graph-based stateful agent framework — nodes, edges, and checkpointing for complex multi-step reasoning, human-in-the-loop, and production workflows.

CrewAI

Role-based multi-agent collaboration — define crews of agents with tasks and goals. Adopted by 60%+ of Fortune 500 for enterprise agent deployments.

AutoGen (Microsoft)

Microsoft Research's multi-agent conversation framework — agents debate, collaborate, and review each other's outputs. Best for research-grade workflows.

OpenAI Agents SDK

Production-grade replacement for Swarm — explicit agent handoffs, built-in tracing, and first-party tool integrations for OpenAI-native deployments.

Google ADK

Google's Agent Development Kit — hierarchical agent trees, native Vertex AI and Gemini integration, and A2A protocol for cross-framework agent communication.

Smolagents (HF)

Hugging Face's lightweight agent library — minimal code, code-writing agents, and model-agnostic tool use. Great for fast prototyping.

Local Hosting

Ollama

The developer standard for local LLMs — one command to pull and run Llama 3.3, Qwen3, Mistral, DeepSeek, and more with an OpenAI-compatible API.

LM Studio

Best GUI for local LLMs — visual model browser, download manager, and a local server that mimics OpenAI's API. Excellent on Apple Silicon.

Jan

100% offline desktop AI — clean ChatGPT-style UI, zero telemetry, and agentic Project workspaces with Browser MCP for fully private AI workflows.

Open WebUI

Self-hosted ChatGPT-style interface wrapping Ollama or any OpenAI-compatible backend — multi-user, RAG, plugins, and conversation history.

vLLM

High-throughput inference engine for production — PagedAttention for GPU memory efficiency, best-in-class tool calling, and OpenAI-compatible serving at scale.

Blogs

Lil'Log — Lilian Weng

Deep-dive survey posts by ex-OpenAI VP of Research — transformers, diffusion models, RL, and agent memory. Essential for conceptual depth.

Simon Willison's Blog

Hands-on LLM experiments, tool use, and prompt injection research — one of the best builder-focused technical voices in the GenAI space.

Ahead of AI — Raschka

Sebastian Raschka's newsletter — LLM architecture comparisons, training deep dives, and rigorous code-first explanations of cutting-edge research.

Chip Huyen's Blog

Real-world MLOps and production ML systems — AI engineering patterns, inference optimization, and hard-won lessons from building LLM products.

Eugene Yan's Blog

Applied ML, recsys, and LLM engineering from Amazon Principal Scientist — bridges academia and product with practical patterns for shipping AI systems.

Videos

Andrej Karpathy

Zero-to-hero neural network series, GPT from scratch, and state-of-LLM lectures — the single best deep learning resource on YouTube.

Lex Fridman

Long-form interviews with Sam Altman, Ilya Sutskever, Yann LeCun, and other AI leaders — philosophy, research, and the future of intelligence.

3Blue1Brown

Visual deep learning — animated explanations of neural networks, attention, and transformers that make abstract math intuitive and memorable.

Yannic Kilcher

In-depth ML paper walkthroughs — transformers, reasoning models, and RL explained with whiteboard-level rigour for serious practitioners.

Two Minute Papers

Fast, engaging breakdowns of the latest AI research papers — Károly Zsolnai-Fehér distills key ideas from complex work into digestible 5-minute videos.

Courses

Prompt Engineering for Devs

Andrew Ng & Isa Fulford's free short course — iterative prompt development, summarisation, inference, transformation, and chatbot building with the OpenAI API.

Retrieval Augmented Generation

End-to-end RAG course — chunking, vector databases, hybrid search, reranking, evaluation, and production deployment. ~5hrs/week for one month.

Agentic RAG with LlamaIndex

Build routers, tool-calling agents, and multi-document research assistants with LlamaIndex. Taught by Jerry Liu, LlamaIndex co-founder and CEO.

LangChain: Chat with Your Data

Learn document loading, splitting, embeddings, vector stores, retrieval, and Q&A chains using LangChain — a hands-on intro to RAG applications.

Deep Learning Specialization

Andrew Ng's 5-course foundational program — neural networks, CNNs, sequence models, and structuring ML projects. The canonical entry point to deep learning.

Generative AI for Everyone

Andrew Ng's non-technical GenAI course — how LLMs work, prompt engineering, real-world applications, and responsible AI use. Rated 4.8★ with 3300+ reviews.

Web Search APIs

Tavily

Purpose-built search API for AI agents — relevance filtering, answer extraction, and structured results ready for LLM consumption. 1M+ downloads. Acquired by Nebius (2026).

Exa AI

Neural semantic search API for agents — find pages by meaning rather than keywords, with full content extraction. Strong for technical doc retrieval.

Brave Search API

Independent search index — not a Google/Bing wrapper. Top scorer in 2026 agentic benchmarks (14.89). LLM Context API optimised for agent consumption.

Perplexity Sonar API

Perplexity's API for grounded LLM answers — Sonar and Sonar Pro return LLM-synthesised responses with cited sources. Best for factual Q&A agents.

Serper

Google Search API wrapper — fast, cheap at scale ($0.30–$1/1k queries), JSON results, and a 2500-query free tier. Great for cost-sensitive high-volume agents.

Firecrawl

Web scraping and crawling API for LLMs — turns any URL into clean markdown, handles JS-rendered pages, and pairs with search APIs for full-content agent pipelines.

Research Digest

Papers I read, summarised as blogs or decks

Blog May 2025

DarkForest: Reducing Debate in Multi-Agent LLMs

How belief aggregation without inter-agent debate yields 30.7% accuracy gains and 6.5× token savings across six reasoning tasks.

Blog Feb 2025

GRPO: Group Relative Policy Optimisation for LLMs

A breakdown of DeepSeek-R1's training recipe — how reward shaping and GRPO replace RLHF for reasoning model training.

📑

More digests coming soon

I'm reading and summarising papers on attention mechanisms, MoE architectures, and agentic planning. Check back regularly.

LLM Benchmarks

A plain-English guide to how AI models are tested

Reference Guide · 2025–2026 LLM Benchmarks Decoded A plain-English guide to how AI models are tested — what each benchmark measures, how it's set up, what the scores mean, and where they can be gamed. 22 Benchmarks 9 Themes Updated June 2026 Read the guide

Data Visualization

Capability rings — every model across every category

Model Landscape

Live LiveBench leaderboard

A live model leaderboard built from LiveBench benchmark CSVs. Pick a category to drill into individual benchmarks, or select multiple dates to see how models move over time.

Top Overall Model

Top Coding Model

Top Data Analysis Model

Top Agentic Coding Model

Top Models by Category

Loading the latest leaderboard…

Parallel coordinates by benchmark metric

Individual benchmark axes for the selected category. The vertical scale auto-fits the visible data rather than starting at zero.

Waiting for data

Latest models by thinking effort

Each family's strongest current model, compared by overall score at the same reasoning-effort setting.

Family leaderboards

Latest model versions in each family, ranked by overall score. Inner ring = weaker, outer ring = stronger; darker shades mark the more capable models.

Date	Model	Overall	Reasoning	Coding	Agentic	Math	Data	Language	Instruction
Loading data…

What the benchmark says right now

Loading the latest LiveBench release to generate a model comparison summary…

The simple takeaway

After loading, this summarizes the current leaders and what they are good at.

Important caveat

LiveBench scores are useful but not a universal answer to "which model is best?" A model that wins on math may not fit cost-sensitive summarization, low-latency chat, safety review, or tool-calling workflows.

Loading data to produce the generated comparison notes…

Live data from the LiveBench public CSV releases. Averages are simple arithmetic means computed in-browser for exploration and may differ from LiveBench's official aggregation.