Exploration
The GenAI Lab
A living workspace for agents, skills, LLM concepts, tools, research digests, and model comparisons. Updated as I learn.
Agent Skills
Loading…What are Skills?
Reusable instructions
for AI agents
A skill is a markdown file — SKILL.md — that gives Claude (or any LLM agent) a precise, reusable set of instructions for a specific task. Think of it as a structured prompt template that lives in your repo and travels with your code.
Instead of re-explaining context every time, you drop a skill into your workflow and the agent knows exactly what to do — which tools to call, how to format output, what constraints apply.
Skill Library
LLM Concepts
Click any concept to read a dedicated pageFoundation & Training
Zero-shot, few-shot, chain-of-thought, system prompts
LoRA, RLHF, instruction tuning, dataset curation
BPE, token limits, vector spaces, semantic similarity
Attention, MLP layers, positional encoding, KV cache
Pretraining, SFT, RLHF, DPO — how each shapes behavior
Memory & Retrieval
Chunking, vector DBs, hybrid search, re-ranking
Token limits, lost-in-the-middle, chunking strategies
Short-term, long-term, episodic, semantic memory patterns
Inference & Output
Temperature, top-p, top-k, greedy vs beam search
JSON mode, tool use, function calling, Pydantic validation
Causes, detection, mitigation, citation strategies
Vision, audio, PDF — how LLMs process non-text inputs
Evaluation & Production
MMLU, HumanEval, LLM-as-judge, custom evals
Caching, quantization, batching, prompt compression
Constitutional AI, RLHF, prompt injection, jailbreaks
Agentic Systems
ReAct, tool use, planning loops, multi-agent coordination
Tool registration, server-client model, skill integration
SKILL.md pattern, composability, sub-agents, reliability
Tool Stack
Resources, platforms & tools I use and exploreOfficial docs for Claude — API reference, prompt engineering guides, model cards, and skill patterns.
GPT-4o, Assistants API, embeddings, fine-tuning, and function calling — OpenAI's developer hub.
Short courses on GenAI, agents, RAG, LangChain, and more from Andrew Ng and leading AI labs.
Open-source models, datasets, Spaces demos, and the Transformers library for hands-on experimentation.
Pre-print server for AI research papers. Essential for staying current on LLM advances, agent frameworks, and benchmarks.
Chatbot Arena by LMSYS — head-to-head model comparisons with ELO ratings from human preference votes.
Framework for building LLM applications — chains, agents, retrievers, memory, and integrations with 100+ tools.
Data framework for LLM applications — indexing, querying, and connecting private data to large language models.
Hands-on LLM experiments, tool use, prompt injection research, and GenAI product analysis — one of the best technical voices.
Zero-to-hero neural network series, GPT from scratch, and state-of-LLM lectures — essential deep learning content.
ML papers with code implementations, state-of-the-art benchmarks, and datasets — bridging research and practice.
My primary AI work environment — coding, analysis, writing, and building skills and agentic workflows with Claude Sonnet.
Research Digest
Papers I read, summarised as blogs or decksDarkForest: Reducing Debate in Multi-Agent LLMs
How belief aggregation without inter-agent debate yields 30.7% accuracy gains and 6.5× token savings across six reasoning tasks.
GRPO: Group Relative Policy Optimisation for LLMs
A breakdown of DeepSeek-R1's training recipe — how reward shaping and GRPO replace RLHF for reasoning model training.
More digests coming soon
I'm reading and summarising papers on attention mechanisms, MoE architectures, and agentic planning. Check back regularly.
Model Landscape
Live LiveBench leaderboard, fetched in your browser
A live model leaderboard built from LiveBench benchmark CSVs. The script fetches every public release, adds a _date column, computes simple category averages, and renders them as parallel-coordinate and trend charts. It loads automatically — pick a category to drill into individual benchmarks, or select multiple dates to see how models move over time.
| Date | Model | Category | Overall | Reasoning | Coding | Language | Data |
|---|---|---|---|---|---|---|---|
| Loading data… | |||||||
What the benchmark says right now
Loading data to produce the generated comparison notes…
Live data from the LiveBench public CSV releases. Averages are simple arithmetic means computed in-browser for exploration and may differ from LiveBench's official aggregation.