Exploration

The GenAI Lab

A living workspace for agents, skills, LLM concepts, tools, research digests, and model comparisons. Updated as I learn.

Agent Skills

Loading…

What are Skills?

Reusable instructions
for AI agents

A skill is a markdown file — SKILL.md — that gives Claude (or any LLM agent) a precise, reusable set of instructions for a specific task. Think of it as a structured prompt template that lives in your repo and travels with your code.

Instead of re-explaining context every time, you drop a skill into your workflow and the agent knows exactly what to do — which tools to call, how to format output, what constraints apply.

Portable by design Each skill is a plain markdown file — version-controlled, shareable, and model-agnostic.
Composable with MCP & sub-agents Skills plug into Claude's tool-use layer — combine multiple skills in one agentic pipeline.
Provider-agnostic structure The library is organised by provider — Anthropic, OpenAI, and custom — so you can mix and match.

Skill Library

LLM Concepts

Click any concept to read a dedicated page

Foundation & Training

Prompt engineering

Zero-shot, few-shot, chain-of-thought, system prompts

Fine-tuning

LoRA, RLHF, instruction tuning, dataset curation

Tokenization & embeddings

BPE, token limits, vector spaces, semantic similarity

Transformer architecture

Attention, MLP layers, positional encoding, KV cache

Pretraining vs post-training

Pretraining, SFT, RLHF, DPO — how each shapes behavior

Memory & Retrieval

RAG

Chunking, vector DBs, hybrid search, re-ranking

Context window management

Token limits, lost-in-the-middle, chunking strategies

Memory & state for agents

Short-term, long-term, episodic, semantic memory patterns

Inference & Output

Inference & sampling

Temperature, top-p, top-k, greedy vs beam search

Structured outputs

JSON mode, tool use, function calling, Pydantic validation

Hallucination & grounding

Causes, detection, mitigation, citation strategies

Multimodal inputs

Vision, audio, PDF — how LLMs process non-text inputs

Evaluation & Production

Evals & benchmarking

MMLU, HumanEval, LLM-as-judge, custom evals

Cost & latency optimization

Caching, quantization, batching, prompt compression

Safety, alignment & guardrails

Constitutional AI, RLHF, prompt injection, jailbreaks

Agentic Systems

Agent architectures

ReAct, tool use, planning loops, multi-agent coordination

MCP

Tool registration, server-client model, skill integration

Agent skills

SKILL.md pattern, composability, sub-agents, reliability

Tool Stack

Resources, platforms & tools I use and explore
🧠
Documentation Anthropic Docs

Official docs for Claude — API reference, prompt engineering guides, model cards, and skill patterns.

Documentation OpenAI Platform

GPT-4o, Assistants API, embeddings, fine-tuning, and function calling — OpenAI's developer hub.

🎓
Learning DeepLearning.AI

Short courses on GenAI, agents, RAG, LangChain, and more from Andrew Ng and leading AI labs.

🤗
Model Hub Hugging Face

Open-source models, datasets, Spaces demos, and the Transformers library for hands-on experimentation.

📄
Research arXiv — AI/ML

Pre-print server for AI research papers. Essential for staying current on LLM advances, agent frameworks, and benchmarks.

🏟️
Benchmarking LM Arena

Chatbot Arena by LMSYS — head-to-head model comparisons with ELO ratings from human preference votes.

🔗
Framework LangChain

Framework for building LLM applications — chains, agents, retrievers, memory, and integrations with 100+ tools.

🦙
Framework LlamaIndex

Data framework for LLM applications — indexing, querying, and connecting private data to large language models.

✍️
Blog / Reading Simon Willison's Blog

Hands-on LLM experiments, tool use, prompt injection research, and GenAI product analysis — one of the best technical voices.

📺
Video / Learning Andrej Karpathy

Zero-to-hero neural network series, GPT from scratch, and state-of-LLM lectures — essential deep learning content.

💻
Research Papers With Code

ML papers with code implementations, state-of-the-art benchmarks, and datasets — bridging research and practice.

AI Assistant Claude (claude.ai)

My primary AI work environment — coding, analysis, writing, and building skills and agentic workflows with Claude Sonnet.

Research Digest

Papers I read, summarised as blogs or decks
arXiv · 2605.25188
Blog May 2025

DarkForest: Reducing Debate in Multi-Agent LLMs

How belief aggregation without inter-agent debate yields 30.7% accuracy gains and 6.5× token savings across six reasoning tasks.

arXiv · DeepSeek-R1
Blog Feb 2025

GRPO: Group Relative Policy Optimisation for LLMs

A breakdown of DeepSeek-R1's training recipe — how reward shaping and GRPO replace RLHF for reasoning model training.

📑

More digests coming soon

I'm reading and summarising papers on attention mechanisms, MoE architectures, and agentic planning. Check back regularly.

Model Landscape

Live LiveBench leaderboard, fetched in your browser

A live model leaderboard built from LiveBench benchmark CSVs. The script fetches every public release, adds a _date column, computes simple category averages, and renders them as parallel-coordinate and trend charts. It loads automatically — pick a category to drill into individual benchmarks, or select multiple dates to see how models move over time.

Loading the latest LiveBench release…
Rows loaded
--
Model-date records after combining every release.
Models in view
--
Unique models matching the active filters.
Top model
--
Based on selected category average.
Benchmark axes
--
Metric columns used in the parallel chart.
Parallel profile by broad category
Compresses many benchmark columns into simple buckets: reasoning, coding, language, data transformation, and spatial/logical tasks. Model names sit at each end of every line; hover a line to focus it. Pick a category above to drill into its individual axes.
Waiting for data
Trend over release dates
For models that appear across multiple releases, this shows whether their selected-category score improved, held steady, or declined over time.
Select 2+ dates
Tick at least two release dates above to enable this chart.
DateModelCategoryOverallReasoningCodingLanguageData
Loading data…

What the benchmark says right now

Loading the latest LiveBench release to generate a model comparison summary…
The simple takeaway
After loading, this summarizes the current leaders and what they are good at.
Important caveat
LiveBench scores are useful but not a universal answer to "which model is best?" A model that wins on math may not fit cost-sensitive summarization, low-latency chat, safety review, or tool-calling workflows.

Loading data to produce the generated comparison notes…

Live data from the LiveBench public CSV releases. Averages are simple arithmetic means computed in-browser for exploration and may differ from LiveBench's official aggregation.