AI Scribes Capture More Symptoms—But Treat Fewer Patients

February 24, 2026

·

6 min

AI Ambient Scribes in Primary Care: A Documentation Paradox With Psychiatric Consequences

A cohort study published in JAMA Psychiatry finds that ambient AI scribes are associated with significantly richer neuropsychiatric documentation—yet paradoxically lower rates of depression-related clinical intervention. The findings raise fundamental questions about the relationship between AI-generated clinical notes and the quality of mental health care in primary care settings.

Background: AI Scribes and the Documentation Promise

Artificial intelligence–driven ambient scribes—tools that use speech recognition and large language models to automatically generate narrative clinical notes from recorded patient encounters—have achieved remarkably rapid adoption across health systems in the United States. Promoted primarily as solutions to clinician burnout and documentation burden, these tools have garnered widespread enthusiasm among physicians exhausted by the demands of the electronic health record.

To date, most investigations of ambient AI scribes have examined their effects on clinician-facing outcomes: time spent in the EHR, self-reported satisfaction, and documentation efficiency. Evidence in these domains has been mixed. One study suggested clinicians spent an average of 5 fewer minutes per visit on the EHR when using ambient scribes, while others yielded inconsistent results regarding productivity gains. Critically absent from the literature has been any systematic examination of whether AI-generated notes change how physicians actually practice medicine—particularly in domains as consequential as psychiatric care.

A new cohort study from Massachusetts General Hospital and Harvard Medical School, published in JAMA Psychiatry, addresses this gap directly. The findings are both encouraging and alarming in equal measure.

Study Design: A Matched Four-Group Comparison

Castro, McCoy, Verhaak, Ramachandiran, and Perlis drew upon EHR data from two large academic health systems in eastern Massachusetts—Massachusetts General Hospital and Brigham and Women's Hospital—to examine 20,302 outpatient primary care annual visit notes. Notes were collected between February 2023 and February 2025, spanning the period of ambient AI scribe deployment across these systems.

The investigators used a matched retrospective case-control design, creating four parallel groups of approximately 5,075 visits each, matched on age, sex, self-reported race, and prior depression diagnosis: visits using an ambient AI scribe, visits using a human virtual scribe, contemporaneous unscribed visits occurring during the same period, and prior-year unscribed visits from before AI scribe deployment. Matching on clinician and visit-year cohort provided a robust framework for isolating scribe-related effects from practice drift or temporal confounding.

To quantify psychiatric documentation, the investigators applied a HIPAA-compliant large language model (GPT-4o, hosted via Microsoft Azure) to each clinical narrative, generating estimated scores across all six National Institute of Mental Health Research Domain Criteria (RDoC) dimensions: negative valence, positive valence, cognitive systems, social processes, arousal and regulatory systems, and sensorimotor systems.

Key Finding I: AI Scribes Dramatically Increase Documented Psychiatric Symptom Burden

Across all six RDoC domains, AI-scribed notes showed significantly higher symptom scores compared with every comparator group (P < .001 for all contrasts). In the negative valence domain—most directly relevant to depression—mean scores were 2.05 in AI-scribed notes versus 1.79 for human-scribed notes and 1.57 for contemporaneous unscribed notes. Arousal domain scores were 2.84 with AI scribes compared with 2.05 without a scribe. Sensorimotor scores were 2.33 versus 1.54 in the unscribed contemporaneous group.

AI-scribed notes were also substantially longer: a mean of 13,629 characters compared with 7,932 characters for contemporaneous unscribed notes and 7,489 characters for prior-year notes. Human-scribed notes were the longest at 16,252 characters.

The authors frame this finding as a potential opportunity for improving care:

"Our results are reassuring, suggesting that AI scribes in primary care have the potential to increase documentation of neuropsychiatric symptoms."

Key Finding II: More Documentation, Less Action

Despite greater documented psychiatric symptom burden, AI-scribed visits were significantly less likely to result in a psychiatric intervention. The composite outcome—defined as the presence of any depression-related ICD-10 code, new antidepressant prescription, or behavioral health referral—occurred in only 708 visits (14%) in the AI scribe group, compared with 843 (17%) in human-scribed visits, 855 (17%) in contemporaneous unscribed visits, and 805 (16%) in prior-year unscribed visits. All contrasts with the AI scribe group were statistically significant at Bonferroni-corrected P < .001 or better.

In the multivariable logistic regression model adjusted for age, sex, race, ethnicity, insurance, education, and prior depression diagnosis, the adjusted odds ratio for any psychiatric intervention at AI-scribed versus contemporaneous unscribed visits was 0.83 (95% CI, 0.72–0.95). By contrast, no significant difference was observed between human-scribed and unscribed visits (aOR, 0.97; 95% CI, 0.85–1.11).

Depression-related ICD-10 codes were assigned at only 9% of AI-scribed visits, compared with 12% of human-scribed and unscribed contemporaneous visits. New antidepressant prescriptions were initiated at 1% of AI-scribed visits versus 2% of prior-year unscribed visits.

The authors articulate the central tension in these findings plainly:

"In this study examining clinical documentation from more than 20,000 outpatient annual visits, including roughly 5,000 incorporating AI scribes, we found that use of these scribes was associated with greater documented levels of neuropsychiatric symptoms compared with the use of human scribes or no scribe but lesser likelihood of a depression intervention."

Mechanistic Hypotheses: The Autopilot Analogy

The authors propose a compelling—and unsettling—mechanistic hypothesis for this dissociation between documentation richness and clinical responsiveness. Drawing an analogy to aviation, they suggest that automating documentation may paradoxically reduce the cognitive engagement of the clinician:

"One explanation for this association could be that automating documentation leads clinicians to be less active in general, analogous to reduced proficiency observed in pilots after the emergence of autopilot."

This hypothesis implies that the act of documenting—when performed manually—may itself reinforce clinical attention and prompt therapeutic decision-making. When that cognitive labor is offloaded to an AI, the loop between observation and action may be disrupted, even as the note itself becomes more thorough. The effect was specific to AI scribing: human scribes showed no analogous reduction in psychiatric intervention rates, suggesting that the mechanism may relate to the nature of automated versus active documentation.

Implications for Practice and Health System Leaders

The study's findings demand careful consideration by any physician or health system administrator who has deployed or is considering deploying ambient AI scribes. The authors are deliberate in their call for further investigation:

"The rapid dissemination of AI scribes in medicine poses both an opportunity and a risk... many interventions in medicine have been adopted without clear evidence of benefit—particularly those, like scribes, that do not require formal regulatory review to establish effectiveness."

For primary care physicians, the practical implications are immediate. If AI-mediated documentation is associated with reduced attentiveness to mental health symptoms—even as those symptoms are more thoroughly recorded—then the note may increasingly diverge from the clinical encounter. A richer record does not guarantee a more responsive physician.

For health system leaders and quality improvement teams, the findings suggest a need for deliberate countermeasures: EHR-embedded decision support tools that prompt psychiatric intervention when symptom documentation exceeds a threshold, structured check-ins or peer review targeting mental health care gaps in AI-scribed practices, and prospective surveillance of quality metrics across scribed and unscribed clinics.

Limitations

The study carries important limitations. All data originate from affiliated academic health systems in eastern Massachusetts—predominantly White, English-speaking, commercially insured populations—limiting generalizability. The observational design cannot establish causation, and residual confounding by clinician-level variables (e.g., personality, burnout level, AI acceptance) could not be fully controlled. The RDoC scoring methodology, while validated in prior work, may be sensitive to documentation style rather than true symptom severity, a concern reinforced by the fact that PHQ-9 scores were nearly identical across all four groups. Future research incorporating clinician-rated measures, structured patient-reported outcomes, and longitudinal clinical outcome data will be essential.

Conclusion

This study does not indict AI ambient scribes—it complicates them. The technology appears capable of producing richer, more symptom-comprehensive clinical narratives. But richness on paper does not translate automatically into action at the bedside. For the millions of primary care patients who present annually with unaddressed depression and anxiety, the gap between documentation and intervention is not an abstraction—it is a missed diagnosis, an untreated episode, a referral that never happened. As ambient AI scribes become the default documentation modality across American primary care, the imperative is clear: physicians and health systems must monitor not just what the note says, but what it prompts them to do.

Related Posts

Blog Post Image

March 25, 2026

·

6 min

When AI Alerts Override Clinical Judgment, Who's Liable?

AI-driven sepsis flags, wearable monitors generating false positives, and agentic systems replacing nurse calls—clinical AI is accelerating without sufficient validation.

Blog Post Image

March 20, 2026

·

6 min

Performance Drives Patient Trust More Than Governance

A national survey of 3,000 U.S. adults reveals that AI performance — not FDA approval or physician oversight — is the single strongest driver of patient trust in medical AI. AI performing at specialist level increased visit selection by 32.5%, a finding with direct implications for how practices deploy and communicate AI tools.

Blog Post Image

March 10, 2026

·

5 min

AI Health Tools Are Here—But Are They Clinically Ready?

ChatGPT Health launched in January 2026—but a new study reveals it failed to properly triage the most and least serious cases.

Blog Post Image

March 4, 2026

·

7 min

Food Is Medicine: The $1.1T Case for Clinical Action Now

Poor diet drives CVD, type 2 diabetes, and stroke—costing $1.1 trillion annually in the US alone. A landmark JAMA Health Forum special communication argues that physicians now have the policy tools, EHR infrastructure, and clinical workflows to make "Food is Medicine" a standard of care—if they choose to act.

Blog Post Image

February 24, 2026

·

6 min

AI Scribes Capture More Symptoms—But Treat Fewer Patients

AI ambient scribes produce richer psychiatric documentation across all 6 neuropsychiatric domains—yet AI-scribed visits were 17% less likely to result in a depression diagnosis, new prescription, or behavioral health referral. Documentation and action are diverging.

Blog Post Image

February 11, 2026

·

4 min

Telehealth Cuts Both Good and Bad Tests—What Physicians Must Know

A landmark JAMA Network Open study of 22,547 propensity-matched annual visits reveals that virtual visits reduce high-value test ordering by 14.3% and low-value test ordering by 19.3% compared with in-person visits. Telehealth's promise as a care-quality lever is more complicated—and more consequential—than previously understood.