AI-Generated Discharge Summaries Reduce Physician Burnout Without Compromising Safety: Evidence From a Prospective Stanford Pilot
Background and Clinical Context
The hospital discharge summary occupies a peculiar and burdensome position in the modern clinical workflow. It is simultaneously essential to safe care transitions — associated with reduced medication errors and lower readmission rates — and among the most time-consuming documentation tasks a hospitalist physician performs. It is a known driver of "pajama time," work completed outside scheduled hours, and a significant contributor to clinician burnout. Despite broad awareness of this problem, scalable solutions have remained elusive.
Since 2023, large language models (LLMs) have attracted considerable interest as a mechanism to automate the synthesis of clinical text. Retrospective analyses have shown promise, but until now, prospective clinical evidence evaluating safety, real-world adoption, and physician well-being has been conspicuously absent. A team of researchers from Stanford University sought to fill that gap.
Study Design and Intervention
This single-arm prospective pilot quality improvement study was conducted at a Stanford Health Care inpatient medicine unit in Redwood City, California, from August 1 to October 11, 2025, with a pre-pilot baseline period extending from April 9 to July 31, 2025. All 11 attending hospitalist physicians staffing the unit during the intervention period were enrolled.
The intervention — MedAgentBrief — was a custom agentic workflow powered by Gemini 2.5 Pro that generated draft hospital course summaries nightly from patient history, physical documentation, and daily progress notes. Each morning, summaries were delivered via secure email to treating physicians in an interactive HTML format with inline source citations, enabling direct copy-and-paste into Epic. Crucially, use was entirely voluntary.
The system's architecture distinguished it from simpler approaches. Rather than submitting all clinical notes to the model in a single query, the pipeline employed a three-stage iterative process: draft generation, refinement, and explicit hallucination-reduction steps. As the authors note, this architecture differs from single-pass approaches by decomposing the summarization task into discrete stages, enabling processing of arbitrarily long hospitalizations while maintaining grounding in source documents at each step.
Key Findings: Safety and Error Profile
Over the 10-week pilot, the system generated 1,274 summaries across 384 discharges for 331 unique patients. Physicians used AI content in 219 cases (57.0%). Among formal safety evaluations of 100 summaries, the results were reassuring: physicians rated 88 unedited summaries (88.0%) as having no harm potential and 1 (1.0%) as likely to cause moderate harm; no severe harm was reported.
The error profile was informative. Omissions were the most frequent issue, noted in 25% of reviewed summaries, followed by inaccuracies in 20%. Hallucinations — fabricated content not present in the patient's EHR — occurred in just 2% of summaries, a rate the authors compare favorably against rates exceeding 40% reported in studies of single-pass LLM clinical text generation. No incorrect citations were reported.
The authors offer an important conceptual distinction between these error types. Hallucinations and inaccuracies represent verification failures — generated text that contradicts source data — and are increasingly addressable through retrieval-augmented generation and external verification tools. Omissions, by contrast, reflect a more persistent challenge: a value alignment problem in which the model fails to recognize clinically important information even when it is present in the source text. Unlike hallucinations, alignment-driven omissions cannot be solved by simply checking against the source text; they require the AI to reliably predict what human experts value, which involves implicit preferences that are difficult to formalize.
Burnout and Cognitive Burden: The Headline Finding
The most clinically significant finding of this study was the measurable reduction in physician burnout. Using the Stanford Professional Fulfillment Index (PFI) Work Exhaustion Scale — scored from 0 to 4, with higher scores indicating greater burnout — mean physician burnout scores decreased significantly from before to after the intervention (1.75; 95% CI, 1.16–2.34 vs 1.20; 95% CI, 0.71–1.69; P = .03). This shift is not merely statistically significant — it is clinically meaningful. The group mean moved from above to below the established PFI burnout threshold of 1.33, with improvements across all four Work Exhaustion subscales: sense of dread (−0.80), lack of enthusiasm (−0.60), physical exhaustion (−0.50), and emotional exhaustion (−0.30).
Cognitive burden, measured by the NASA Task Load Index (score range 0–100), trended downward from a mean of 57.5 to 52.3 but did not reach statistical significance (P = .30). Individual responses were heterogeneous; two of ten physicians reported increases in burnout scores, and three reported increases in cognitive load — a reminder that population-level benefits may not be uniformly distributed.
The Efficiency Paradox: Cognitive Offloading Over Clock Time
A compelling secondary finding concerns the divergence between subjective and objective efficiency. Among the seven physicians with matched baseline data, five (71.4%) demonstrated reductions in median documentation time, with changes of up to 2.9 minutes — a difference that did not reach statistical significance. Yet in subjective feedback surveys, 67% of responses indicated perceived time savings, with nearly one-third estimating savings exceeding 15 minutes per summary.
The authors interpret this through the lens of cognitive offloading: the AI serves as a scaffolding tool, providing a structured starting point that physicians review and refine rather than generate de novo. This shifts the value proposition from efficiency to sustainability, explaining why burnout improved when clock time did not.
This interpretation aligns with emerging evidence from ambient AI scribes and LLM-generated draft message replies, which have similarly demonstrated well-being improvements without proportionate time savings. The 57% voluntary use rate observed in this pilot compares favorably to adoption rates of 20% for AI draft replies and 30–38% for ambient scribes in comparable pilot settings, suggesting that AI tools delivering usable, workflow-integrated content may achieve meaningful organic uptake.
Implications for Clinical Practice and Health Systems
This study provides what prior retrospective evaluations could not: prospective, in-workflow evidence that an agentic LLM system can be deployed in active clinical care with a manageable safety profile and a measurable benefit to physician well-being. For hospital medicine leaders and clinical informaticists, several practical implications follow.
First, the agentic architecture — iterative, document-grounded, citation-linked — appears to substantially mitigate hallucination risk relative to simpler LLM approaches. Health systems evaluating AI documentation vendors should ask hard questions about how generation pipelines are structured and how outputs are verified against source data.
Second, the omission problem demands ongoing attention and cannot be resolved through technical means alone. Future iterations will require mechanisms to capture physician preference data at scale, enabling models to learn what constitutes clinically important information across varying patient populations and care contexts. As the authors conclude: addressing these will require incorporating structured EHR data, adjusting generation timing, and developing scalable methods to align model outputs with physician judgment.
Third, health system leaders should be attentive to the risk of negating well-being gains through productivity recalibration. If AI-derived time savings are repurposed into demands for higher patient throughput, the burnout benefits observed here could rapidly erode.
Limitations and Generalizability
The study was conducted at a single academic inpatient unit staffed exclusively by attending hospitalists, limiting generalizability to settings where residents draft discharge summaries. The sample of 11 physicians, while sufficient for this pilot, constrains statistical power for subgroup analyses. The 40.2% feedback response rate, while high for a prospective clinical pilot, introduces potential selection bias. A contemporaneous control group was not employed, and systematic error rates in human-authored discharge summaries were not assessed for comparison.
Conclusion
This Stanford pilot represents a meaningful step forward in the evidence base for clinical AI documentation tools. In a field too often characterized by retrospective validation and vendor-driven implementation, the prospective, safety-monitored, burnout-sensitive design of this study offers a model for responsible deployment. The finding that physician burnout crossed below a validated clinical threshold — in just 10 weeks, with voluntary adoption, and without compromising patient safety — warrants serious attention from every physician leader managing a hospitalist program.

.png)
.png)
.png)
.png)
.png)
.png)