AI Ambient Scribes in Primary Care: A Documentation Paradox With Psychiatric Consequences
A cohort study published in JAMA Psychiatry finds that ambient AI scribes are associated with significantly richer neuropsychiatric documentation—yet paradoxically lower rates of depression-related clinical intervention. The findings raise fundamental questions about the relationship between AI-generated clinical notes and the quality of mental health care in primary care settings.
Background: AI Scribes and the Documentation Promise
Artificial intelligence–driven ambient scribes—tools that use speech recognition and large language models to automatically generate narrative clinical notes from recorded patient encounters—have achieved remarkably rapid adoption across health systems in the United States. Promoted primarily as solutions to clinician burnout and documentation burden, these tools have garnered widespread enthusiasm among physicians exhausted by the demands of the electronic health record.
To date, most investigations of ambient AI scribes have examined their effects on clinician-facing outcomes: time spent in the EHR, self-reported satisfaction, and documentation efficiency. Evidence in these domains has been mixed. One study suggested clinicians spent an average of 5 fewer minutes per visit on the EHR when using ambient scribes, while others yielded inconsistent results regarding productivity gains. Critically absent from the literature has been any systematic examination of whether AI-generated notes change how physicians actually practice medicine—particularly in domains as consequential as psychiatric care.
A new cohort study from Massachusetts General Hospital and Harvard Medical School, published in JAMA Psychiatry, addresses this gap directly. The findings are both encouraging and alarming in equal measure.
Study Design: A Matched Four-Group Comparison
Castro, McCoy, Verhaak, Ramachandiran, and Perlis drew upon EHR data from two large academic health systems in eastern Massachusetts—Massachusetts General Hospital and Brigham and Women's Hospital—to examine 20,302 outpatient primary care annual visit notes. Notes were collected between February 2023 and February 2025, spanning the period of ambient AI scribe deployment across these systems.
The investigators used a matched retrospective case-control design, creating four parallel groups of approximately 5,075 visits each, matched on age, sex, self-reported race, and prior depression diagnosis: visits using an ambient AI scribe, visits using a human virtual scribe, contemporaneous unscribed visits occurring during the same period, and prior-year unscribed visits from before AI scribe deployment. Matching on clinician and visit-year cohort provided a robust framework for isolating scribe-related effects from practice drift or temporal confounding.
To quantify psychiatric documentation, the investigators applied a HIPAA-compliant large language model (GPT-4o, hosted via Microsoft Azure) to each clinical narrative, generating estimated scores across all six National Institute of Mental Health Research Domain Criteria (RDoC) dimensions: negative valence, positive valence, cognitive systems, social processes, arousal and regulatory systems, and sensorimotor systems.
Key Finding I: AI Scribes Dramatically Increase Documented Psychiatric Symptom Burden
Across all six RDoC domains, AI-scribed notes showed significantly higher symptom scores compared with every comparator group (P < .001 for all contrasts). In the negative valence domain—most directly relevant to depression—mean scores were 2.05 in AI-scribed notes versus 1.79 for human-scribed notes and 1.57 for contemporaneous unscribed notes. Arousal domain scores were 2.84 with AI scribes compared with 2.05 without a scribe. Sensorimotor scores were 2.33 versus 1.54 in the unscribed contemporaneous group.
AI-scribed notes were also substantially longer: a mean of 13,629 characters compared with 7,932 characters for contemporaneous unscribed notes and 7,489 characters for prior-year notes. Human-scribed notes were the longest at 16,252 characters.
The authors frame this finding as a potential opportunity for improving care:
"Our results are reassuring, suggesting that AI scribes in primary care have the potential to increase documentation of neuropsychiatric symptoms."
Key Finding II: More Documentation, Less Action
Despite greater documented psychiatric symptom burden, AI-scribed visits were significantly less likely to result in a psychiatric intervention. The composite outcome—defined as the presence of any depression-related ICD-10 code, new antidepressant prescription, or behavioral health referral—occurred in only 708 visits (14%) in the AI scribe group, compared with 843 (17%) in human-scribed visits, 855 (17%) in contemporaneous unscribed visits, and 805 (16%) in prior-year unscribed visits. All contrasts with the AI scribe group were statistically significant at Bonferroni-corrected P < .001 or better.
In the multivariable logistic regression model adjusted for age, sex, race, ethnicity, insurance, education, and prior depression diagnosis, the adjusted odds ratio for any psychiatric intervention at AI-scribed versus contemporaneous unscribed visits was 0.83 (95% CI, 0.72–0.95). By contrast, no significant difference was observed between human-scribed and unscribed visits (aOR, 0.97; 95% CI, 0.85–1.11).
Depression-related ICD-10 codes were assigned at only 9% of AI-scribed visits, compared with 12% of human-scribed and unscribed contemporaneous visits. New antidepressant prescriptions were initiated at 1% of AI-scribed visits versus 2% of prior-year unscribed visits.
The authors articulate the central tension in these findings plainly:
"In this study examining clinical documentation from more than 20,000 outpatient annual visits, including roughly 5,000 incorporating AI scribes, we found that use of these scribes was associated with greater documented levels of neuropsychiatric symptoms compared with the use of human scribes or no scribe but lesser likelihood of a depression intervention."
Mechanistic Hypotheses: The Autopilot Analogy
The authors propose a compelling—and unsettling—mechanistic hypothesis for this dissociation between documentation richness and clinical responsiveness. Drawing an analogy to aviation, they suggest that automating documentation may paradoxically reduce the cognitive engagement of the clinician:
"One explanation for this association could be that automating documentation leads clinicians to be less active in general, analogous to reduced proficiency observed in pilots after the emergence of autopilot."
This hypothesis implies that the act of documenting—when performed manually—may itself reinforce clinical attention and prompt therapeutic decision-making. When that cognitive labor is offloaded to an AI, the loop between observation and action may be disrupted, even as the note itself becomes more thorough. The effect was specific to AI scribing: human scribes showed no analogous reduction in psychiatric intervention rates, suggesting that the mechanism may relate to the nature of automated versus active documentation.
Implications for Practice and Health System Leaders
The study's findings demand careful consideration by any physician or health system administrator who has deployed or is considering deploying ambient AI scribes. The authors are deliberate in their call for further investigation:
"The rapid dissemination of AI scribes in medicine poses both an opportunity and a risk... many interventions in medicine have been adopted without clear evidence of benefit—particularly those, like scribes, that do not require formal regulatory review to establish effectiveness."
For primary care physicians, the practical implications are immediate. If AI-mediated documentation is associated with reduced attentiveness to mental health symptoms—even as those symptoms are more thoroughly recorded—then the note may increasingly diverge from the clinical encounter. A richer record does not guarantee a more responsive physician.
For health system leaders and quality improvement teams, the findings suggest a need for deliberate countermeasures: EHR-embedded decision support tools that prompt psychiatric intervention when symptom documentation exceeds a threshold, structured check-ins or peer review targeting mental health care gaps in AI-scribed practices, and prospective surveillance of quality metrics across scribed and unscribed clinics.
Limitations
The study carries important limitations. All data originate from affiliated academic health systems in eastern Massachusetts—predominantly White, English-speaking, commercially insured populations—limiting generalizability. The observational design cannot establish causation, and residual confounding by clinician-level variables (e.g., personality, burnout level, AI acceptance) could not be fully controlled. The RDoC scoring methodology, while validated in prior work, may be sensitive to documentation style rather than true symptom severity, a concern reinforced by the fact that PHQ-9 scores were nearly identical across all four groups. Future research incorporating clinician-rated measures, structured patient-reported outcomes, and longitudinal clinical outcome data will be essential.
Conclusion
This study does not indict AI ambient scribes—it complicates them. The technology appears capable of producing richer, more symptom-comprehensive clinical narratives. But richness on paper does not translate automatically into action at the bedside. For the millions of primary care patients who present annually with unaddressed depression and anxiety, the gap between documentation and intervention is not an abstraction—it is a missed diagnosis, an untreated episode, a referral that never happened. As ambient AI scribes become the default documentation modality across American primary care, the imperative is clear: physicians and health systems must monitor not just what the note says, but what it prompts them to do.

.png)
.png)
.png)
.png)
.png)
.png)