Can AI Think Like a Clinician? A JAMA Study Tests 21 Models — and Finds a Dangerous Blind Spot
Background: The Promise and the Problem
Large language models are no longer a distant technology. They are actively being marketed to health systems, private practices, and clinical leaders as tools capable of supporting — and in some cases, supplanting — physician diagnostic reasoning. Yet rigorous, clinically meaningful evaluations of their true performance have lagged behind the pace of commercial deployment.
Most existing AI benchmarks rely on multiple-choice examination formats derived from medical licensing examinations. These assessments, while useful for gauging factual recall, bear little resemblance to the iterative, uncertainty-laden process of real patient care. A clinician seeing a new patient does not choose from four pre-selected answers. She generates hypotheses, weighs probabilistic evidence, orders targeted diagnostics, refines a working diagnosis, and formulates a management plan — often simultaneously and under significant time pressure.
Study Design: A More Demanding Standard
The investigators evaluated 21 frontier LLMs — including GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash, Gemini 3.0 Pro, and Grok 4 — using 29 standardized clinical vignettes drawn from the January 2025 update of the MSD Manual. Each vignette was evaluated sequentially through five domains of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, clinical management, and miscellaneous clinical reasoning questions. Models were scored in triplicate by trained medical student evaluators, yielding a total of 16,254 individual responses.
To move beyond the bluntness of raw accuracy scores, the authors introduced a novel composite metric: the Proportional Index of Medical Evaluation for Large Language Models (PrIME-LLM). Defined as the normalized polygonal area representing balanced accuracy across all five clinical reasoning domains, the PrIME-LLM score was designed to penalize uneven performance — rewarding models that demonstrate consistent clinical competence rather than excelling narrowly in one area while failing in others.
Key Findings: High Marks at the Finish Line, Failure at the Starting Gate
The central finding of this study is both clinically significant and counterintuitive. Across all 21 models tested, performance on final diagnosis and management tasks was relatively strong, with failure rates ranging from just 9% to 39%. By contrast, performance on differential diagnosis — arguably the most cognitively demanding and clinically consequential stage of the diagnostic process — was remarkably poor. Failure rates exceeded 0.80 (range, 0.90–1.00) for differential diagnosis in all models, meaning that no model reliably generated an appropriate differential in more than one out of ten attempts by the most conservative measure.
This discrepancy is not a trivial finding. Differential diagnosis is the intellectual cornerstone of clinical medicine. The ability to hold multiple competing hypotheses simultaneously, assign appropriate prior probabilities, and narrow systematically based on emerging evidence is what separates expert clinicians from pattern-matchers. These data suggest that current LLMs may simulate the endpoint of clinical reasoning without having mastered the process.
Model Performance: Who Led, Who Lagged
Among the five model families evaluated — GPT, Claude, DeepSeek, Gemini, and Grok — performance was meaningfully stratified. PrIME-LLM scores ranged from 0.64 (Gemini 1.5 Flash) to 0.78 (Grok 4), with reasoning-optimized models outperforming nonreasoning models and GPT models scoring highest overall.
The advantage conferred by reasoning-optimized architecture was one of the more actionable findings in the study. Models specifically designed to engage in extended chain-of-thought processing consistently outperformed their non-reasoning counterparts across clinical domains. This suggests that the architectural design of an AI system — not merely the volume of training data — may be a meaningful determinant of clinical utility.
Multimodal capability also proved relevant. Most models demonstrated improved accuracy when image inputs accompanied text-based clinical scenarios, a finding with practical implications for specialties in which visual data — radiology, dermatology, ophthalmology, pathology — are central to clinical decision-making.
The PrIME-LLM Advantage: Why Composite Scoring Matters
One of the most durable contributions of this study may be methodological. The authors argue persuasively that raw accuracy scores, as conventionally reported, can be misleading when evaluating AI for clinical purposes. A model that achieves high accuracy on final diagnosis while failing systematically at differential generation is not a safe clinical tool — it simply appears to be one under traditional metrics.
The PrIME-LLM framework addresses this directly. By computing a normalized polygonal area across all five reasoning domains, the metric captures the shape of a model's performance profile, not just its average. A model with uneven strengths is penalized relative to one with balanced competence. As the study notes, "the PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks."
For clinicians and administrators evaluating AI vendors, this framework offers a more honest lens through which to assess clinical-grade AI claims.
Implications for Private Practice and Clinical Leadership
The findings of this study carry direct relevance for physicians and practice administrators who are currently evaluating, piloting, or deploying AI-assisted clinical decision support tools. Several practical implications merit attention.
First, the performance gap in differential diagnosis is not a minor limitation — it represents a failure at the most upstream and consequential stage of clinical reasoning. AI tools marketed as diagnostic support should be held to rigorous, domain-specific standards before being integrated into patient-facing workflows.
Second, the study underscores the inadequacy of benchmark examinations as proxies for real-world clinical performance. Physicians evaluating vendor claims based on performance on Step 1 or licensing examination analogues should be aware that such metrics may substantially overstate clinical utility.
Third, the study authors are explicit about the bottom line: "despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning."
Conclusions
This rigorous cross-sectional evaluation of 21 frontier LLMs across 16,254 clinical reasoning responses delivers an important and sobering message for medicine's AI moment. While incremental improvements are evident across model generations, and while reasoning-optimized architectures offer measurable advantages, no currently available off-the-shelf model demonstrates the balanced, full-spectrum clinical reasoning required for autonomous or unsupervised patient-facing deployment.
As the authors conclude, "current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making."
For physicians and health system leaders, the message is clear: AI may soon be a powerful partner in clinical reasoning, but that moment has not yet arrived. In the interim, the imperative is to deploy these tools in appropriately supervised, clearly scoped roles — and to demand the kind of multidimensional, domain-specific benchmarking that this study demonstrates is both possible and necessary.

.png)
.png)
.png)
.png)
.png)
.png)