AI Hallucination in Clinical Documentation: What Professionals Need to Know

AI tools are fabricating clinical content in real-world documentation. Learn what hallucination is, why it happens, what incidents have been reported, and how to evaluate AI tools that won't put your license at risk.

The Problem No One Is Talking About Enough

There is a scenario playing out in clinical practices across the country that rarely makes it into professional development trainings, licensing board newsletters, or vendor marketing materials. A therapist finishes a session, opens their AI documentation tool, and reviews the generated note. The note is polished, well-structured, and reads exactly like their other notes. They sign it.

What they may not have caught: a sentence describing a trauma history the client never disclosed. A severity rating for symptoms the client did not report. A clinical judgment call they never made.

This is AI hallucination, and it is happening in clinical documentation right now.

The goal of this article is not to scare professionals away from using AI tools. AI-assisted documentation is a legitimate efficiency gain that can meaningfully reduce the 30-60 minutes per day many clinicians spend writing notes. But using these tools without understanding their failure modes is a real professional liability. Your signature on a note is your endorsement of its accuracy, regardless of how it was generated.

What Is AI Hallucination?

Hallucination is the term the AI research community uses to describe situations where a generative AI model produces content that is factually incorrect, invented, or unsupported by its inputs — stated with the same confidence as accurate content.

The term sounds dramatic, but the mechanism is mundane. Large language models work by predicting the most statistically likely next token (word, phrase, or character) given what came before. They do not "know" facts the way a database does. They generate plausible-sounding sequences. Most of the time, this produces accurate, coherent output. Sometimes it produces confident fiction.

For most applications, this is an annoyance. If an AI writes a marketing email with a made-up statistic, you catch it in review. The cost is minor.

In clinical documentation, the cost is not minor.

Documented Incidents in Clinical Settings

The following are reported cases and patterns where AI tools generated clinical content that did not accurately reflect what occurred in sessions.

Fabricated Abuse History

One of the most widely reported incidents involved Upheal, a therapy documentation platform, producing notes that contained references to a client's history of abuse that the client had never disclosed. This was reported by The New York Times in the context of broader concerns about AI accuracy in mental health settings. The clinician in the incident did not catch the fabrication before signing, meaning the client's official clinical record contained false information about their trauma history.

Consider the downstream consequences: that record could follow the client to a new provider, influence a diagnosis, affect their treatment plan, show up in an insurance review, or surface in a custody or legal proceeding. A single hallucinated sentence in a progress note is not a minor error.

Invented Symptoms

Several clinicians have reported instances where AI-generated notes described symptoms the client did not report during the session. A common pattern involves the AI inferring likely symptom presentation based on diagnosis and context, then documenting those inferred symptoms as observed or reported by the client.

For example: a client with a diagnosis of generalized anxiety disorder comes in for a session and discusses work stress. The AI, drawing on its training data about GAD, generates a note that includes "client reported difficulty sleeping and physical tension" — even if the client said nothing about sleep or physical symptoms in that session.

This is not a documentation error. It is a fabrication. And because the language sounds exactly like legitimate clinical language, it is easy to miss in a cursory review.

Exaggerated Severity

A related pattern involves AI tools that over-represent clinical severity in documentation. Notes describe distress as more acute than the clinician observed, risk as higher than assessed, or impairment as more significant than the client reported. This appears to occur because high-severity language is statistically common in training data (clinicians write more detailed notes when things are serious), so the model associates clinical documentation with severity markers.

The problem: inflated severity in notes can trigger unnecessary clinical interventions, affect insurance authorizations, and create a documented record that does not match the actual clinical picture.

Misattributed Statements

Several clinicians have reported notes where direct quotes were either invented outright or assigned to the wrong party. A statement the therapist made gets attributed to the client. Something the client said gets paraphrased into a clinical conclusion the clinician never drew. This is especially common in longer or more complex sessions where the AI must process a large amount of input and maintain attribution across many turns of dialogue.

Why Generative AI Hallucinates

Understanding why hallucination happens helps clinicians evaluate tools more critically.

The Model Is Predicting, Not Recalling

Language models do not have access to "what happened." They have input (your session notes, a transcript, a few bullet points) and training (billions of documents about clinical practice, therapy sessions, medical notes, and human behavior). When the input is ambiguous or thin, the model fills gaps using what it has seen before.

If you give an AI tool three sparse bullet points from a session, it will produce a full note. Most of what is in that note beyond your three points came from the model's statistical sense of what a note about a session like this usually contains. Sometimes that inference is correct. Sometimes it is wrong. The model has no way to know the difference.

Clinical Notes Are High-Pattern Content

Therapy notes, SOAP notes, and progress notes follow predictable patterns. They use consistent terminology, repeat structural elements, and contain standard phrases. This high-pattern nature makes them a good target for generative AI in one sense (the structure is learnable), but it also means the model has strong priors about what a note "should" say. Those priors can override the actual content of a sparse or ambiguous input.

Confidence Is Not Accuracy

One of the most dangerous properties of large language models is that their output reads with consistent confidence regardless of accuracy. A hallucinated sentence about a client's abuse history looks exactly like an accurately documented sentence about a client's abuse history. There are no uncertainty markers, no hedging, no footnotes. The model does not know what it does not know, and it does not signal doubt.

This is fundamentally different from human documentation errors. When a clinician is uncertain, they typically say so: "Client appeared to be..." or "It is unclear whether..." AI-generated hallucinations do not hedge.

Why Clinical Documentation Is Especially High-Stakes

For most industries, the primary risk of AI hallucination is embarrassment or wasted time. In clinical practice, the risks are qualitatively different.

Licensing liability. Your signature on a note is a professional attestation of its accuracy. Signing a note that contains fabricated clinical content, even if you did not write it, is a documentation compliance issue with potential licensing consequences. The tool is not your license. You are.

Legal exposure. Progress notes are legal documents. They can be subpoenaed in litigation, custody disputes, disability claims, and criminal proceedings. A note that contains fabricated content about a client's history or mental state is a false document with your signature on it.

Continuity of care. When a client transfers to a new provider, that provider reads your notes. If your notes contain hallucinated content, the new clinician may make clinical decisions based on history that does not exist, symptoms that were not present, or severity that was not accurate.

Client harm. In serious cases, hallucinated documentation can directly harm clients. False records of suicidal ideation could lead to unnecessary hospitalization. Fabricated disclosures could trigger mandatory reporting obligations based on events that did not occur. Inflated severity could affect insurance coverage or treatment authorization.

What to Look For When Evaluating AI Documentation Tools

If you are evaluating or currently using an AI documentation tool, these are the questions that should guide your assessment.

What does the tool use as input?

The quality of AI output is constrained by the quality and specificity of input. A tool that generates notes from a 30-second recording or a handful of bullet points is producing a much higher proportion of inference and fill-in content than a tool that works from detailed structured input.

Ask: where does this note come from? What did I actually provide, and what did the AI infer?

Does the tool use open-ended generation or structured fill?

There are two fundamentally different architectures for AI documentation tools.

Open-ended generation uses a language model to write a full note from scratch, based on some input. The model decides what to include, what level of detail to add, and what clinical language to use. Hallucination risk is highest with this architecture.

Structured fill (sometimes called template-first or placeholder-fill) uses a predefined template with specific fields. The AI fills only those fields, using only what was provided as input. There is no invitation for the model to invent supplementary content, because the output shape is fixed.

The template-first approach does not eliminate AI errors, but it structurally constrains where errors can appear. If your template has a field for "risk assessment," the AI fills that field with what you provided about risk, not with a synthesized risk statement it constructed from patterns. If a field is empty because you did not provide data for it, it stays empty.

What is the review and editing workflow?

Any responsible AI documentation tool should require you to review and explicitly confirm the generated note before signing. Tools that make signing too easy, bury diffs between your input and the output, or make it cumbersome to edit the generated content are increasing your hallucination exposure.

The question to ask: before I sign this note, is it obvious to me what the AI wrote versus what I provided?

Does the tool have a BAA?

For any tool handling Protected Health Information (PHI), a Business Associate Agreement is required under HIPAA. This is separate from the hallucination question, but it matters for the same underlying reason: your legal and ethical obligations do not transfer to the vendor. If the vendor does not offer a BAA, the tool is not compliant for clinical use, regardless of its other features.

Is there a clear audit trail?

Can you see, in the documentation system, what input you provided and what the AI generated? This matters for two reasons: it helps you catch hallucinations during review, and it protects you in the event your documentation practices are scrutinized.

The Template-First Approach: A Structural Safeguard

One of the most effective architectural choices against AI hallucination in documentation is requiring that note structure and content boundaries be defined before generation begins.

The logic is straightforward. If the AI knows that a note has exactly these fields, and each field should be filled only from what the clinician provided, the surface area for hallucination shrinks dramatically. The model is not being asked to construct a narrative. It is being asked to fill placeholders. Those are very different tasks.

A template-first system does not mean rigid, generic notes. Templates can be as customized as the clinician wants: specific sections for their discipline, specific language for their documentation style, specific fields for the elements they always capture. The customization happens at the template level. The AI's job is to populate the template from the clinician's actual session notes, not to compose freely.

This is the approach NotuDocs takes. The positioning is explicit: "Your notes, your template. AI just fills the blanks." The template defines what the note will contain. The clinician's notes provide the content. The AI fills the placeholders. There is no room for the AI to decide that a note about anxiety "should" also mention sleep disturbances, because the template does not have a field for that, and the session notes do not mention it.

Does this mean NotuDocs cannot hallucinate? No AI tool can make that claim. But the architecture makes hallucination harder by constraining the generation task. Instead of "write a clinical note about this session," the model receives "fill this specific field with information from these notes." The scope is narrower. The opportunities for invention are fewer.

Practical Steps for Clinicians Using Any AI Tool

Regardless of which tool you use, these practices reduce your hallucination risk.

Treat every AI-generated note as a first draft, not a final product. The time saving from AI documentation comes from reducing the time you spend writing from scratch, not from eliminating your review obligation. A quick read-through of a well-structured draft is significantly faster than writing from scratch and still catches most errors.

Know what you actually said in the session. This sounds obvious, but it requires that you have some form of contemporaneous capture: session anchors, bullet points, or structured shorthand jotted during or immediately after the session. If you cannot remember whether you assessed risk in the session, you cannot catch a note that claims you did.

Audit your notes periodically. Pull five random notes from the past month and compare them against your session memory or any notes you took. Look for content that you do not specifically remember documenting. Look for language that does not sound like your observations. If you find patterns, investigate your documentation workflow.

Read the clinical content, not just the structure. Hallucinated content often appears in the substance of the note, not in the formatting. A note that has all the right sections in the right order can still contain fabricated clinical material within those sections. Read the words, not just the template.

Document your input. Keep a record of what you provided to the AI tool: the bullet points, the session summary, the structured input. This protects you if a note is ever questioned, because you can demonstrate the gap (or absence of gap) between your input and the output.

A Note on Professional Responsibility

The adoption of AI in clinical documentation is moving faster than the regulatory and professional frameworks designed to govern it. Licensing boards are beginning to address it, professional associations have started publishing guidance, and malpractice carriers are updating their policies. But the landscape is still forming.

In the meantime, the most important principle to carry into any AI-assisted documentation practice is this: AI is a tool, and you are the professional. The tool can make you faster. It can reduce the cognitive overhead of converting session observations into formal documentation. It can help you write more consistent notes with less effort.

It cannot assess your client. It cannot exercise clinical judgment. And it cannot take responsibility for what appears in the record under your signature.

Understanding AI hallucination is not about being skeptical of technology. It is about using technology in a way that protects your clients, your professional obligations, and your license. That means choosing tools with responsible architectures, reviewing everything before signing, and staying informed as the standards in this space develop.

Related guides: