← All posts
AI researchbenchmarkdeep research

Same brief, five tools, two judges: a deep research case study

Ajin Kabeer · May 5, 2026LinkedInTwitterGitHub

Cover image

This is a case study, not a definitive benchmark. One brief. Five tools. Two independent judges who scored the outputs without seeing each other's results. We're publishing it because the patterns that emerged are worth discussing - and because we'll run it again on different topics to see what holds.

The question we started with: every AI tool does web search now. The question isn't whether they can find information. It's whether what they find shapes what they look for next.


The brief

The loneliness infrastructure audit - how America's physical environment has been optimized for commercial efficiency at the expense of human connection. Drive-thru-ification, hostile architecture, the $406B economic cost of social isolation. We picked this topic deliberately: it rewards going beyond the obvious and punishes confident summarization. A brief where the interesting findings aren't in the prompt.

ProviderProcessWord countFull trail visible
MeaningfulScout + 3 parallel deep research runs + synthesis~6,200 wordsYes
Claude.aiSingle response~3,700 wordsNo
ChatGPTSingle response~3,800 wordsNo
PerplexitySingle response~2,800 wordsNo
GeminiSingle response~3,700 wordsNo

We scored on six dimensions: depth of findings beyond the brief, emergent insights, structural rigor (decision frameworks, not just prose), citation quality, evidentiary honesty (naming what the evidence doesn't prove), and actionability.


The scores

One methodological note: Claude Sonnet 4.6 was one of our judges, and our pipeline is built on Claude. So we also ran the same rubric through GPT-5.4 as a second independent judge. Neither saw the other's results. Where they differed, GPT-5.4 scored Meaningful higher, not lower. ChatGPT scored identically across both judges.

Note on citation scores: ChatGPT and Gemini both provide inline citations in their native interfaces. The low citation scores reflect how those citations appear in exported markdown - as opaque bracketed tokens rather than live hyperlinks. It's a format limitation, not an absence of sourcing.

Claude Sonnet 4.6 as judge

ProviderDepthEmergent InsightsStructureCitationsHonestyActionabilityTotal
Meaningful98999953/60
Claude.ai994510441/60
ChatGPT85926737/60
Perplexity85643430/60
Gemini74832428/60

GPT-5.4 as judge

ProviderDepthEmergent InsightsStructureCitationsHonestyActionabilityTotal
Meaningful10910991057/60
Claude.ai994410440/60
ChatGPT85926737/60
Perplexity74534326/60
Gemini84732428/60

What this particular brief revealed

We're careful not to over-generalise from one experiment. But a few patterns showed up clearly enough to be worth naming.

The difference isn't whether they searched - it's how

All five tools did live web searches. Claude surfaced that Americans' daily time with friends dropped from 60 to 26 minutes over 20 years. Gemini found that bowling league membership fell from 9.8M to under 1.5M. Real findings, properly cited in their native interfaces.

The difference on this brief was the search strategy itself. A single-pass tool fixes what it's looking for the moment it reads the brief. Our pipeline does a broad reconnaissance first, reads what came back, then asks: given what we now know, what's still missing that would change the recommendation? The second and third deep research runs are shaped by what the first one returned.

The Wharton finding in our output - that each 1% increase in delivery app penetration correlates with a 1.6% increase in restaurant closure rates - came from a targeted run dispatched because the reconnaissance flagged third-place closures as a gap worth quantifying. Whether that pattern holds on different topics is something we're still testing.

Structure and insight pulled apart

On this brief, the structure scores split the field in an interesting way. ChatGPT and Meaningful both scored 9-10 - ChatGPT produced a prioritized intervention table with stakeholders, costs, and impact ratings. Claude.ai scored a 4; its recommendations appeared under a section it labeled optional.

But ChatGPT scored a 5 on emergent insights, assembling the brief's themes without surfacing many conclusions that required research to discover. Claude.ai scored a 9 on the same dimension - its observation that Starbucks' seat-removal reversal was a "structural admission that friction was the product" is the kind of synthesis that justifies commissioning secondary research.

Structure without insight is a well-organized summary. Insight without structure is analysis that can't drive a decision. On this brief, only our output scored highly on both.

The counterintuitive result: Claude.ai was the most honest

Claude.ai scored a perfect 10 on evidentiary honesty - the only perfect score from either judge across all five providers.

It called out that the $406B figure is an advocacy estimate, not a primary-source number. It dedicated a full section to the counterargument that loneliness rates may not actually be rising on validated survey scales. It named what the data doesn't prove.

Gemini scored a 2 on the same dimension, presenting the "engineering outcome" thesis as settled fact without naming a causal gap.

Our output scored a 9. We dedicated a full research angle to what the evidence does not yet prove - specifically that no controlled study has isolated the effect of removing micro-interactions on loneliness outcomes. That acknowledgment came from actually looking for the evidence and not finding it, which is a different kind of honesty than being careful with language.


What we think it means - with the caveat that it's one experiment

The gap in scores on this brief wasn't about which model is smarter or which prompt was better written. Claude.ai used Claude. We also used Claude. The model isn't the variable.

On this topic, the variable was process. Single-pass tools produce one output from one search pass. Our pipeline produces a trail - scout findings, individual angle research, final synthesis - because each of those things actually happened in sequence, each step shaped by the one before it.

We're running this again on two different briefs to see whether the pattern holds. We'll publish those results regardless of what comes back.

Ajin