Exploration
The GenAI Lab
A living workspace for agents, skills, LLM concepts, tools, research digests, and model comparisons. Updated as I learn.
Agent Skills
Loading…What are Skills?
Reusable instructions
for AI agents
A skill is a markdown file — SKILL.md — that gives Claude (or any LLM agent) a precise, reusable set of instructions for a specific task. Think of it as a structured prompt template that lives in your repo and travels with your code.
Instead of re-explaining context every time, you drop a skill into your workflow and the agent knows exactly what to do — which tools to call, how to format output, what constraints apply.
Skill Library
LLM Concepts
Click any concept to read a dedicated pageFoundation & Training
Zero-shot, few-shot, chain-of-thought, system prompts
LoRA, RLHF, instruction tuning, dataset curation
BPE, token limits, vector spaces, semantic similarity
Attention, MLP layers, positional encoding, KV cache
Pretraining, SFT, RLHF, DPO — how each shapes behavior
Memory & Retrieval
Chunking, vector DBs, hybrid search, re-ranking
Token limits, lost-in-the-middle, chunking strategies
Short-term, long-term, episodic, semantic memory patterns
Inference & Output
Temperature, top-p, top-k, greedy vs beam search
JSON mode, tool use, function calling, Pydantic validation
Causes, detection, mitigation, citation strategies
Vision, audio, PDF — how LLMs process non-text inputs
Evaluation & Production
MMLU, HumanEval, LLM-as-judge, custom evals
Caching, quantization, batching, prompt compression
Constitutional AI, RLHF, prompt injection, jailbreaks
Agentic Systems
ReAct, tool use, planning loops, multi-agent coordination
Tool registration, server-client model, skill integration
SKILL.md pattern, composability, sub-agents, reliability
Tool Stack
AI Assistants
Anthropic's flagship model — excellent for coding, analysis, long-context reasoning, and agentic workflows. Claude Sonnet 4 is my primary work environment.
ChatGPTOpenAI's consumer AI — GPT-4o and o3 models with voice, image generation, browsing, code interpreter, and custom GPTs.
GeminiGoogle's multimodal AI — Gemini 2.5 Pro with 1M token context, Deep Research, and tight integration with Google Workspace.
DeepSeekChinese open-weight frontier model — DeepSeek-R2 with exceptional reasoning and coding at a fraction of the cost of competitors.
GrokxAI's model with real-time X (Twitter) data access, Grok 3 with deep reasoning mode, and image generation via Aurora.
Le Chat (Mistral)Mistral AI's chat interface — fast, European, and multilingual. Mistral Large 2 and the open-source Mistral 7B family.
Perplexity AIAI-native search engine with real-time web retrieval. Great for research, citations, and grounded answers with source links.
HuggingChatOpen-source chat interface by Hugging Face — run Llama 3.3, Mistral, Qwen, and Command R+ for free with web search.
Frameworks
The most popular LLM application framework — chains, retrievers, memory, and 100+ integrations. Foundation for most production RAG pipelines.
LlamaIndexData framework for connecting private data to LLMs — indexing, querying, and 100+ data connectors. Best for RAG-heavy applications.
Hugging FaceOpen-source model hub, Transformers library, PEFT, datasets, and Spaces — the central ecosystem for open-weight AI development.
PydanticAIType-safe agent framework from the Pydantic team — structured outputs, dependency injection, and model-agnostic agent layer with validation built in.
Vercel AI SDKTypeScript-first SDK for building AI-powered web apps — streaming, tool use, React hooks, and support for all major LLM providers.
Agentic Frameworks
Graph-based stateful agent framework — nodes, edges, and checkpointing for complex multi-step reasoning, human-in-the-loop, and production workflows.
CrewAIRole-based multi-agent collaboration — define crews of agents with tasks and goals. Adopted by 60%+ of Fortune 500 for enterprise agent deployments.
AutoGen (Microsoft)Microsoft Research's multi-agent conversation framework — agents debate, collaborate, and review each other's outputs. Best for research-grade workflows.
OpenAI Agents SDKProduction-grade replacement for Swarm — explicit agent handoffs, built-in tracing, and first-party tool integrations for OpenAI-native deployments.
Google ADKGoogle's Agent Development Kit — hierarchical agent trees, native Vertex AI and Gemini integration, and A2A protocol for cross-framework agent communication.
Smolagents (HF)Hugging Face's lightweight agent library — minimal code, code-writing agents, and model-agnostic tool use. Great for fast prototyping.
Local Hosting
The developer standard for local LLMs — one command to pull and run Llama 3.3, Qwen3, Mistral, DeepSeek, and more with an OpenAI-compatible API.
LM StudioBest GUI for local LLMs — visual model browser, download manager, and a local server that mimics OpenAI's API. Excellent on Apple Silicon.
Jan100% offline desktop AI — clean ChatGPT-style UI, zero telemetry, and agentic Project workspaces with Browser MCP for fully private AI workflows.
Open WebUISelf-hosted ChatGPT-style interface wrapping Ollama or any OpenAI-compatible backend — multi-user, RAG, plugins, and conversation history.
vLLMHigh-throughput inference engine for production — PagedAttention for GPU memory efficiency, best-in-class tool calling, and OpenAI-compatible serving at scale.
Blogs
Deep-dive survey posts by ex-OpenAI VP of Research — transformers, diffusion models, RL, and agent memory. Essential for conceptual depth.
Simon Willison's BlogHands-on LLM experiments, tool use, and prompt injection research — one of the best builder-focused technical voices in the GenAI space.
Ahead of AI — RaschkaSebastian Raschka's newsletter — LLM architecture comparisons, training deep dives, and rigorous code-first explanations of cutting-edge research.
Chip Huyen's BlogReal-world MLOps and production ML systems — AI engineering patterns, inference optimization, and hard-won lessons from building LLM products.
Eugene Yan's BlogApplied ML, recsys, and LLM engineering from Amazon Principal Scientist — bridges academia and product with practical patterns for shipping AI systems.
Videos
Zero-to-hero neural network series, GPT from scratch, and state-of-LLM lectures — the single best deep learning resource on YouTube.
Lex FridmanLong-form interviews with Sam Altman, Ilya Sutskever, Yann LeCun, and other AI leaders — philosophy, research, and the future of intelligence.
3Blue1BrownVisual deep learning — animated explanations of neural networks, attention, and transformers that make abstract math intuitive and memorable.
Yannic KilcherIn-depth ML paper walkthroughs — transformers, reasoning models, and RL explained with whiteboard-level rigour for serious practitioners.
Two Minute PapersFast, engaging breakdowns of the latest AI research papers — Károly Zsolnai-Fehér distills key ideas from complex work into digestible 5-minute videos.
Courses
Andrew Ng & Isa Fulford's free short course — iterative prompt development, summarisation, inference, transformation, and chatbot building with the OpenAI API.
Retrieval Augmented GenerationEnd-to-end RAG course — chunking, vector databases, hybrid search, reranking, evaluation, and production deployment. ~5hrs/week for one month.
Agentic RAG with LlamaIndexBuild routers, tool-calling agents, and multi-document research assistants with LlamaIndex. Taught by Jerry Liu, LlamaIndex co-founder and CEO.
LangChain: Chat with Your DataLearn document loading, splitting, embeddings, vector stores, retrieval, and Q&A chains using LangChain — a hands-on intro to RAG applications.
Deep Learning SpecializationAndrew Ng's 5-course foundational program — neural networks, CNNs, sequence models, and structuring ML projects. The canonical entry point to deep learning.
Generative AI for EveryoneAndrew Ng's non-technical GenAI course — how LLMs work, prompt engineering, real-world applications, and responsible AI use. Rated 4.8★ with 3300+ reviews.
Web Search APIs
Purpose-built search API for AI agents — relevance filtering, answer extraction, and structured results ready for LLM consumption. 1M+ downloads. Acquired by Nebius (2026).
Exa AINeural semantic search API for agents — find pages by meaning rather than keywords, with full content extraction. Strong for technical doc retrieval.
Brave Search APIIndependent search index — not a Google/Bing wrapper. Top scorer in 2026 agentic benchmarks (14.89). LLM Context API optimised for agent consumption.
Perplexity Sonar APIPerplexity's API for grounded LLM answers — Sonar and Sonar Pro return LLM-synthesised responses with cited sources. Best for factual Q&A agents.
SerperGoogle Search API wrapper — fast, cheap at scale ($0.30–$1/1k queries), JSON results, and a 2500-query free tier. Great for cost-sensitive high-volume agents.
FirecrawlWeb scraping and crawling API for LLMs — turns any URL into clean markdown, handles JS-rendered pages, and pairs with search APIs for full-content agent pipelines.
Research Digest
Papers I read, summarised as blogs or decksDarkForest: Reducing Debate in Multi-Agent LLMs
How belief aggregation without inter-agent debate yields 30.7% accuracy gains and 6.5× token savings across six reasoning tasks.
GRPO: Group Relative Policy Optimisation for LLMs
A breakdown of DeepSeek-R1's training recipe — how reward shaping and GRPO replace RLHF for reasoning model training.
More digests coming soon
I'm reading and summarising papers on attention mechanisms, MoE architectures, and agentic planning. Check back regularly.
LLM Benchmarks
A plain-English guide to how AI models are testedData Visualization
Capability rings — every model across every categoryModel Landscape
Live LiveBench leaderboardA live model leaderboard built from LiveBench benchmark CSVs. Pick a category to drill into individual benchmarks, or select multiple dates to see how models move over time.
| Date | Model | Overall | Reasoning | Coding | Agentic | Math | Data | Language | Instruction |
|---|---|---|---|---|---|---|---|---|---|
| Loading data… | |||||||||
What the benchmark says right now
Loading data to produce the generated comparison notes…
Live data from the LiveBench public CSV releases. Averages are simple arithmetic means computed in-browser for exploration and may differ from LiveBench's official aggregation.