Strong on Final Diagnosis, Blind at the Start

April 21, 2026

·

5 min

Can AI Think Like a Clinician? A JAMA Study Tests 21 Models — and Finds a Dangerous Blind Spot

Background: The Promise and the Problem

Large language models are no longer a distant technology. They are actively being marketed to health systems, private practices, and clinical leaders as tools capable of supporting — and in some cases, supplanting — physician diagnostic reasoning. Yet rigorous, clinically meaningful evaluations of their true performance have lagged behind the pace of commercial deployment.

Most existing AI benchmarks rely on multiple-choice examination formats derived from medical licensing examinations. These assessments, while useful for gauging factual recall, bear little resemblance to the iterative, uncertainty-laden process of real patient care. A clinician seeing a new patient does not choose from four pre-selected answers. She generates hypotheses, weighs probabilistic evidence, orders targeted diagnostics, refines a working diagnosis, and formulates a management plan — often simultaneously and under significant time pressure.

Study Design: A More Demanding Standard

The investigators evaluated 21 frontier LLMs — including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash, Gemini 3.0 Pro, and Grok 4 — using 29 standardized clinical vignettes drawn from the January 2025 update of the MSD Manual. Each vignette was evaluated sequentially through five domains of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, clinical management, and miscellaneous clinical reasoning questions. Models were scored in triplicate by trained medical student evaluators, yielding a total of 16,254 individual responses.

To move beyond the bluntness of raw accuracy scores, the authors introduced a novel composite metric: the Proportional Index of Medical Evaluation for Large Language Models (PrIME-LLM). Defined as the normalized polygonal area representing balanced accuracy across all five clinical reasoning domains, the PrIME-LLM score was designed to penalize uneven performance — rewarding models that demonstrate consistent clinical competence rather than excelling narrowly in one area while failing in others.

Key Findings: High Marks at the Finish Line, Failure at the Starting Gate

The central finding of this study is both clinically significant and counterintuitive. Across all 21 models tested, performance on final diagnosis and management tasks was relatively strong, with failure rates ranging from just 9% to 39%. By contrast, performance on differential diagnosis — arguably the most cognitively demanding and clinically consequential stage of the diagnostic process — was remarkably poor. Failure rates exceeded 0.80 (range, 0.90–1.00) for differential diagnosis in all models, meaning that no model reliably generated an appropriate differential in more than one out of ten attempts by the most conservative measure.

This discrepancy is not a trivial finding. Differential diagnosis is the intellectual cornerstone of clinical medicine. The ability to hold multiple competing hypotheses simultaneously, assign appropriate prior probabilities, and narrow systematically based on emerging evidence is what separates expert clinicians from pattern-matchers. These data suggest that current LLMs may simulate the endpoint of clinical reasoning without having mastered the process.

Model Performance: Who Led, Who Lagged

Among the five model families evaluated — GPT, Claude, DeepSeek, Gemini, and Grok — performance was meaningfully stratified. PrIME-LLM scores ranged from 0.64 (Gemini 1.5 Flash) to 0.78 (Grok 4), with reasoning-optimized models outperforming nonreasoning models and GPT models scoring highest overall.

The advantage conferred by reasoning-optimized architecture was one of the more actionable findings in the study. Models specifically designed to engage in extended chain-of-thought processing consistently outperformed their non-reasoning counterparts across clinical domains. This suggests that the architectural design of an AI system — not merely the volume of training data — may be a meaningful determinant of clinical utility.

Multimodal capability also proved relevant. Most models demonstrated improved accuracy when image inputs accompanied text-based clinical scenarios, a finding with practical implications for specialties in which visual data — radiology, dermatology, ophthalmology, pathology — are central to clinical decision-making.

The PrIME-LLM Advantage: Why Composite Scoring Matters

One of the most durable contributions of this study may be methodological. The authors argue persuasively that raw accuracy scores, as conventionally reported, can be misleading when evaluating AI for clinical purposes. A model that achieves high accuracy on final diagnosis while failing systematically at differential generation is not a safe clinical tool — it simply appears to be one under traditional metrics.

The PrIME-LLM framework addresses this directly. By computing a normalized polygonal area across all five reasoning domains, the metric captures the shape of a model's performance profile, not just its average. A model with uneven strengths is penalized relative to one with balanced competence. As the study notes, "the PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks."

For clinicians and administrators evaluating AI vendors, this framework offers a more honest lens through which to assess clinical-grade AI claims.

Implications for Private Practice and Clinical Leadership

The findings of this study carry direct relevance for physicians and practice administrators who are currently evaluating, piloting, or deploying AI-assisted clinical decision support tools. Several practical implications merit attention.

First, the performance gap in differential diagnosis is not a minor limitation — it represents a failure at the most upstream and consequential stage of clinical reasoning. AI tools marketed as diagnostic support should be held to rigorous, domain-specific standards before being integrated into patient-facing workflows.

Second, the study underscores the inadequacy of benchmark examinations as proxies for real-world clinical performance. Physicians evaluating vendor claims based on performance on Step 1 or licensing examination analogues should be aware that such metrics may substantially overstate clinical utility.

Third, the study authors are explicit about the bottom line: "despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning."

Conclusions

This rigorous cross-sectional evaluation of 21 frontier LLMs across 16,254 clinical reasoning responses delivers an important and sobering message for medicine's AI moment. While incremental improvements are evident across model generations, and while reasoning-optimized architectures offer measurable advantages, no currently available off-the-shelf model demonstrates the balanced, full-spectrum clinical reasoning required for autonomous or unsupervised patient-facing deployment.

As the authors conclude, "current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making."

For physicians and health system leaders, the message is clear: AI may soon be a powerful partner in clinical reasoning, but that moment has not yet arrived. In the interim, the imperative is to deploy these tools in appropriately supervised, clearly scoped roles — and to demand the kind of multidimensional, domain-specific benchmarking that this study demonstrates is both possible and necessary.

Read the original article here:

Related Posts

Blog Post Image

May 18, 2026

·

6 min

When AI Drafts the Note, Physicians Recover

A prospective Stanford pilot deployed AI-generated hospital discharge summaries across 384 discharges — and physicians used them 57% of the time.

Blog Post Image

April 29, 2026

·

4 min

When AI Answers First, Learning Never Happens

AI may not just deskill practicing physicians — it may prevent trainees from ever developing clinical reasoning at all.

Blog Post Image

May 4, 2026

·

7 min

The FDA Is Phasing Out Animal Testing — Here's What's Changed

The FDA's landmark 2025 Roadmap to Reducing Animal Testing is no longer aspirational — it's operational. With overall drug development success rates estimated at only 10%, the agency is replacing animal models with AI, organoids, and organ-on-chip technologies at a pace that is already reshaping preclinical science.

Blog Post Image

April 21, 2026

·

5 min

Strong on Final Diagnosis, Blind at the Start

A landmark JAMA Network Open study tested 21 frontier AI models — including GPT-5, Grok 4, and Claude 4.5 Opus — across 29 clinical vignettes totaling 16,254 responses. The result?

Blog Post Image

March 25, 2026

·

6 min

When AI Alerts Override Clinical Judgment, Who's Liable?

AI-driven sepsis flags, wearable monitors generating false positives, and agentic systems replacing nurse calls—clinical AI is accelerating without sufficient validation.

Blog Post Image

March 20, 2026

·

6 min

Performance Drives Patient Trust More Than Governance

A national survey of 3,000 U.S. adults reveals that AI performance — not FDA approval or physician oversight — is the single strongest driver of patient trust in medical AI. AI performing at specialist level increased visit selection by 32.5%, a finding with direct implications for how practices deploy and communicate AI tools.