DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

The Problem

When More Communication Breaks the Answer

Multi-agent LLM systems have a dirty secret: the more agents talk to each other, the more likely they are to agree on the wrong answer — with rising confidence.

The dominant approach in multi-agent AI today is debate and dialogue — agents share their full reasoning, critique each other, and revise their views over multiple rounds. Frameworks like AutoGen, CAMEL, and debate-based systems all operate on this intuition: more conversation equals better coordination.

But DarkForest's authors identified a fatal flaw. When one agent makes an early mistake and broadcasts it, downstream agents don't independently verify — they get subtly pulled toward the same wrong answer. By the end, you have a unanimous, confident, wrong consensus. Agreement stops being signal and starts being noise.

The central question is not how to make agents communicate more, but how to control what information crosses agent boundaries.

— DarkForest paper, Section 1

There's also a blunt economic cost: multi-round communication inflates token consumption, latency, and inference cost at scale. In a production system running thousands of queries per day, this isn't academic — it's directly dollars and seconds.

The Key Insight

The Right Answer Is Already in the Room

Here is the most striking finding of the paper. The researchers checked: in how many cases does at least one agent, working independently, produce the correct answer?

The answer was: most of the time. The correct answer exists in the initial, independent outputs. The problem is that coordination methods then discard it — overwriting good independent evidence with bad consensus.

The coordination gap — where correct answers get lost

On MATH benchmark: percentage of examples where at least one agent had the correct answer vs. final system accuracy by method

Candidate availability (upper bound) Final accuracy

This gap between candidate availability and final accuracy is the fundamental problem DarkForest is designed to solve. The intelligence is already there — it's the coordination layer that's losing it.

The Solution

How DarkForest Works

DarkForest takes its name from the concept of strategic information concealment — inspired by incomplete-information game theory. Agents don't reveal their hands to each other. Instead, a neutral coordinator sees only a structured, calibrated summary of what they concluded.

The system runs in five tightly integrated stages:

1

independent_generation

Independent candidate generation

Each agent answers the query in complete isolation — no peeking at others' responses, reasoning traces, or confidence scores. This is the foundation: preserving genuine independence means that when agents later agree, it actually means something.

2

parsing + canonicalization

Parsing and canonicalization

Raw LLM output is messy — full of reasoning chains, formatting quirks, and inconsistencies. A task-specific parser converts each response into a compact structured record: a canonical answer, a confidence score, a parse-validity flag, and quality metadata. Long free-form text is compressed into a comparable unit.

3

candidate_clustering

Candidate clustering

Semantically equivalent answers — even if worded differently — get grouped into clusters. Each cluster records not just how many agents supported it, but which agents. That identity matters: two independent, complementary agents agreeing is much stronger evidence than two near-identical models agreeing.

4

calibrated_belief

Calibrated belief construction

This is the mathematical core. Each candidate cluster gets a weighted evidence score that accounts for agent historical reliability, parse quality, inter-agent independence, confidence, and the reliability of the specific coalition that agreed. Simple majority voting is replaced by a nuanced probability distribution.

5

controlled_disclosure + guardrail

Controlled disclosure and guardrail

The coordinator LLM receives only a policy-filtered summary — parsed candidates, support patterns, posterior probabilities — not raw reasoning. A deterministic guardrail then overrides the coordinator's output only when the belief state very strongly favors a different candidate. Precision over verbosity.

Under the Hood

The Calibrated Belief Formula

The scoring function at the heart of DarkForest replaces naive vote-counting with a multi-factor weighted score:

s(z) = R_π × Σ [ α_i · ρ_i · δ_i · φ(c_i) ]

α = agent reliability R_π = coalition reliability ρ = parse quality penalty δ = independence correction φ(c) = confidence multiplier

The independence correction δ is particularly clever. If two agents share the same base model or training distribution, their agreement carries less statistical weight. DarkForest discounts correlated agents so they can't double-count evidence they effectively share. Two diverse experts agreeing means more than two identical agents agreeing.

Experimental Results

The Numbers That Make the Case

The system was tested across six reasoning domains — math, code generation, general knowledge, scientific QA, finance, and law — against six coordination baselines including debate, round-table, and self-consistency methods.

30.7%

max improvement over strongest baseline

6.5×

reduction in token consumption vs. communication-heavy methods

6

reasoning benchmarks evaluated across math, code, law, finance, science

Accuracy across six reasoning benchmarks

DarkForest vs. best-performing baseline per domain (representative values based on paper results)

DarkForest Best baseline

Token consumption per sample

DarkForest achieves leading accuracy while consuming far fewer tokens than communication-heavy baselines (MATH dataset)

DarkForest Communication-heavy baselines Moderate baselines

Communication overhead vs. final accuracy — the efficiency frontier

DarkForest is the only method that achieves both high accuracy and low token cost simultaneously

DarkForest (ideal zone) Moderate methods High-communication baselines

At a Glance

How DarkForest Compares

Method	Communication style	Error propagation risk	Token cost	Accuracy
Single agent	None	—	Very low	Baseline
Self-consistency / majority vote	None (parallel)	Low	Moderate	Moderate
CAMEL / AutoGen (multi-round)	Full reasoning traces	High	Very high	Moderate
Debate (Du et al.)	Multi-round debate	High	High	Moderate–High
Graph-of-Agents	Structured message passing	Moderate	Moderate–High	High
DarkForest NEW	Controlled belief summary only	Very low	Low	Leading

Implications

Why This Matters for Agentic AI Builders

The practical impact of DarkForest extends well beyond the benchmarks tested. If you're building any system where multiple LLM agents collaborate — think CrewAI pipelines, orchestrator-worker architectures, or multi-model ensembles — the core lesson applies directly.

The conventional instinct is to wire agents together tightly: let them read each other's work, debate, revise. DarkForest shows that this instinct can actively hurt you. Every message passed between agents is a potential vector for error amplification, and the longer the chain, the higher the compounding risk.

The smarter design is to keep individual agents doing what they do best — reasoning independently — and then use a principled aggregation layer that treats their outputs as statistical evidence, not as ground truth to be merged naively.

For enterprise deployments: at scale, a 6.5× token reduction isn't just a performance win — it's a direct cost reduction that compounds across every query, every user, every day. DarkForest's approach makes multi-agent reasoning economically viable for high-volume production workloads.

The paper also opens an important theoretical lens: viewing multi-agent coordination as an incomplete information game, where the central problem is not "how do agents talk?" but "what is the optimal information policy?" This framing is far more rigorous than existing approaches and provides a foundation for future work on structured agent coordination.

The Takeaway

Silent Agents, Smarter Systems

DarkForest makes a counterintuitive argument with convincing evidence: in multi-agent AI systems, communication is not free, and more of it is not always better. The right answer is already in the room — the job of the coordination layer is to find it, not to talk over it.

By treating agent outputs as structured evidence to be calibrated rather than text to be merged, DarkForest achieves the rare combination of state-of-the-art accuracy and dramatically lower inference cost across six diverse reasoning domains.

The forest is dark for a reason. Sometimes silence is the sharpest strategy.

Read the original paper:
Li et al., "DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs," arXiv:2605.25188, May 2026.
Code: github.com/PearLoveTana/DarkForest_Review