AI Risk

Confabulation and Hallucination Risk: What NIST AI 600-1 Says and How to Test for It

April 23, 2026 Rebecca Leung
Table of Contents

TL;DR:

  • NIST AI 600-1 identifies confabulation as one of its 12 primary generative AI risk categories — not a footnote, but a first-class compliance risk with specific testing requirements.
  • The framework requires pre-deployment TEVV covering confabulation rates across domain-specific tasks, explicit go/no-go thresholds, and continuous post-deployment monitoring.
  • Compliance teams typically treat hallucination as a product quality problem. Examiners will treat it as a model risk governance failure if you can’t show your controls.
  • This post covers exactly what NIST AI 600-1 requires, which testing techniques satisfy those requirements, and how to document the results for an AI governance review.

Most compliance teams think about LLM hallucination the same way product teams do — a quality problem to minimize, an annoying failure mode that occasionally surfaces wrong information. That framing misses what’s at stake in regulated environments. When your LLM writes a compliance advisory with a fabricated regulatory citation, when your customer chatbot invents a fee structure that doesn’t exist, when your contract review tool confidently cites a clause that’s absent from the document — you don’t have a quality problem. You have a model risk governance failure.

NIST AI 600-1, the Generative AI Profile released July 2024, treats it that way. Confabulation is one of 12 formally identified GenAI risk categories, with specific actions mapped across all four NIST AI RMF functions — GOVERN, MAP, MEASURE, and MANAGE. If you’re deploying GenAI in financial services and you can’t point to a confabulation testing program, you’re missing a documented NIST expectation.

Here’s what the framework actually says and how to build the program around it.


What NIST AI 600-1 Means by Confabulation

The formal definition matters: NIST AI 600-1 defines confabulation as “the production of confidently stated but erroneous or false content.” The document goes further, specifying that confabulations also include “generated outputs that diverge from the prompts or other input or that contradict previously generated statements in the same context.”

That’s a broader definition than most teams are working with. It captures:

  • Factual fabrication — invented facts, statistics, citations, entities
  • Prompt divergence — the model answers a different question than was asked
  • Internal contradiction — responses that conflict with what the model said moments earlier
  • Attribution failure — the model cites a real source but misrepresents what it says

The root cause, per NIST, is architectural: LLMs generate outputs that approximate the statistical distribution of their training data. They are fundamentally optimized to produce plausible text, not verified facts. Confabulations are what happens when statistical plausibility and factual accuracy diverge — which in financial services happens constantly, because regulatory specifics, client data, and recent enforcement actions are poorly represented in most training corpora.

NIST specifically calls out consequential decision-making domains: “Risks of confabulated content may be especially important to monitor when integrating GAI into applications involving consequential decision making.” Credit, compliance, risk assessment, customer interaction — all fall squarely in that category.


The 12 Risk Categories and Where Confabulation Sits

NIST AI 600-1 identifies 12 primary risk categories for generative AI systems:

#Risk CategoryFinancial Services Relevance
1CBRN Information or CapabilitiesLow — unless dual-use research
2ConfabulationHigh — all customer-facing and analytical GenAI
3Dangerous, Violent, or Hateful ContentLow-medium
4Data PrivacyHigh — training data, PII in prompts
5Environmental ImpactsLow for compliance teams
6Harmful Bias or HomogenizationHigh — fair lending, UDAAP exposure
7Human-AI ConfigurationHigh — over-reliance risk
8Information IntegrityHigh — regulatory filings, reports
9Information SecurityHigh — prompt injection, data leakage
10Intellectual PropertyMedium
11Obscene, Degrading, Abusive ContentLow-medium
12Value Chain and Component IntegrationHigh — third-party model dependencies

For most financial services AI use cases — compliance chatbots, contract review, regulatory analysis, adverse action explanations — confabulation is a Tier 1 risk. So is information integrity (whether generated content is factually grounded). These two are related and often require overlapping controls.


What NIST AI 600-1 Requires: Controls by Function

The framework organizes confabulation controls across all four NIST AI RMF functions. Most teams implement MEASURE (testing) but skip GOVERN (governance structure), MAP (risk framing), and MANAGE (ongoing treatment). That’s a coverage gap.

GOVERN: Policy and accountability structure

Before any testing happens, your governance framework needs to establish:

  • A classification for GenAI use cases that identifies confabulation as a risk category requiring pre-deployment evaluation
  • Clear ownership: who is responsible for confabulation testing, who reviews results, who has authority to block deployment
  • A policy or standard that defines acceptable confabulation thresholds for different deployment contexts (a regulatory research assistant has different tolerances than a customer-facing chatbot giving compliance advice)

The GOVERN function requires that these structures exist before models reach MEASURE. An organization that runs confabulation tests but has no policy defining what acceptable results look like has checked a procedural box without the governance substance.

MAP: Risk framing before you test

The MAP function requires you to characterize the confabulation risk before testing begins. That means:

  • Impact assessment: What decisions does this model inform? What happens if it fabricates a regulatory citation, invents a compliance deadline, or contradicts a policy document? Who is harmed?
  • Deployment context documentation: Is this model customer-facing? Analyst-facing? Does it produce regulatory submissions? The confabulation risk profile is fundamentally different across contexts.
  • Domain characterization: What regulatory domains, product types, or factual domains will the model be queried on? Your TEVV design needs to cover these domains — generic hallucination tests are insufficient.

This is the step teams skip most often. If your TEVV doesn’t test the domains where the model will actually be deployed — specific regulations, product structures, compliance requirements — you’re testing statistical performance, not operational risk.

MEASURE: The TEVV requirement in detail

This is where most teams spend their effort, and it’s also where NIST AI 600-1 is most specific. The framework calls for pre-deployment TEVV that assesses confabulation rates across domain-specific tasks, with defined thresholds and go/no-go gates before deployment.

What pre-deployment confabulation TEVV looks like:

1. Benchmark testing

Standardized benchmarks establish a baseline confabulation rate against known-answer tasks:

  • TruthfulQA tests whether LLMs propagate common misconceptions — run it on every model version change and significant prompt update. Note: TruthfulQA is now saturated by training data inclusion, so results need to be interpreted alongside other benchmarks.
  • HalluLens (presented at ACL 2025) provides a more recent, less saturated benchmark for both intrinsic and extrinsic confabulation.
  • BLEURT and FactScore evaluate factual precision in generated outputs — particularly useful for document-grounded tasks like contract review or regulatory analysis.

These establish baselines. They don’t replace domain-specific testing.

2. Domain-specific adversarial testing

Generic benchmarks won’t catch the confabulations that matter most. You need to test the specific regulatory and product domains where the model will operate:

  • Prompt the model with questions about regulations that changed recently (post-training data cutoff) and evaluate whether it hallucinates current requirements
  • Test with real client scenarios that have definitive right answers — loan eligibility determinations, disclosure requirements, filing deadlines
  • Ask about specific enforcement actions, cases, or guidance documents and verify citations are real
  • Test with ambiguous or incomplete prompts to evaluate whether the model asks for clarification vs. fabricates assumptions

3. Red-teaming

NIST AI 600-1 explicitly names red-teaming as a required measurement technique. NIST’s own software tool, Dioptra, is designed for AI model testing including red-teaming exercises.

Red-teaming for confabulation should target:

  • Prompts designed to elicit overconfident responses (“What is the exact penalty under Section X?” when no specific penalty exists)
  • Citation requests (“What does OCC Bulletin X say about Y?”) for guidance that either doesn’t exist or has been superseded
  • Multi-turn prompts that challenge earlier responses to test for internal contradiction
  • Edge-case and out-of-distribution queries where training data is thin

4. RAG grounding evaluation

For document-grounded use cases — contract review, regulatory Q&A, policy analysis — RAG architecture is your primary confabulation mitigation. But RAG introduces its own failure modes that need TEVV:

RAG Failure ModeWhat to Test
Retrieval missCorrect answer exists in corpus but wasn’t retrieved — model fills gap with fabrication
Attribution errorModel uses retrieved content but doesn’t cite it, or cites wrong source
Boundary violationModel goes beyond retrieved documents to answer from parametric memory
Retrieval hallucinationModel claims it retrieved something that wasn’t actually in the corpus

Test each failure mode explicitly. RAG reduces hallucination rates by 60–80% in production systems, but that reduction is meaningless if you can’t verify your specific system achieves it in your specific deployment context.

5. Setting thresholds and go/no-go gates

NIST AI 600-1 requires explicit performance and safety thresholds with deployment gates — you need a defined answer to “what confabulation rate is acceptable for this use case?”

The threshold is context-specific:

  • Customer-facing chatbot providing compliance information: lower tolerance — fabricated regulatory requirements expose the institution to liability
  • Internal analyst tool drafting regulatory memos (human review required): higher tolerance — human-in-the-loop catches errors before external publication
  • Regulatory submission drafting assistance: near-zero tolerance — human must verify every citation before submission

Document the threshold, who set it, the rationale, and the TEVV results against it. If the model doesn’t clear the threshold, it doesn’t deploy until retesting shows otherwise.

MANAGE: Post-deployment monitoring and incident response

Confabulation risk doesn’t end at deployment. NIST AI 600-1’s MANAGE function requires:

  • Ongoing monitoring metrics: track user-reported factual errors, output review samples, and downstream correction rates in production
  • Feedback loops: mechanisms for users or reviewers to flag confabulations — and a process that logs, categorizes, and routes them back to the model risk team
  • Retest triggers: define conditions that require re-TEVV — model version updates, prompt changes, new use case expansions, significant user-reported error patterns
  • Incident response: if a confabulation causes a material downstream error (a customer acted on a fabricated compliance requirement, a filing included a nonexistent regulation), that’s an AI incident requiring formal documentation and root cause analysis

The NIST AI RMF MEASURE function post covers the broader TEVV framework in detail — confabulation testing sits within that structure.


Human-in-the-Loop: The Underrated Control

NIST AI 600-1 explicitly addresses human-AI configuration as a separate risk category (risk #7), but its treatment intersects directly with confabulation management. The framework’s guidance on confabulation consistently points toward human oversight as the compensating control when automated testing can’t achieve acceptable thresholds:

  • Don’t deploy GenAI in contexts where confabulations can reach external audiences without human review unless confabulation rates are demonstrably low
  • Define which outputs require independent human verification before action
  • Train users on the specific confabulation failure modes relevant to their use case — most users significantly overestimate LLM factual reliability

The compliance teams that have the most trouble with LLM confabulation are the ones where the AI’s outputs go from model to decision without human check. That’s not just a risk management failure — it’s a governance design failure.


Documentation That Satisfies an Examiner

If an examiner asks to see your confabulation controls for a deployed GenAI system, here’s what the documentation should include, mapped to NIST AI 600-1 requirements:

Documentation ElementNIST AI 600-1 AlignmentWhere to File It
Confabulation risk classification in AI inventoryGOVERN: AI use case registryModel inventory / AI governance register
Impact assessment for confabulation failuresMAP: harm and impact analysisPre-deployment risk assessment
TEVV results with benchmark scoresMEASURE: testing documentationModel validation file
Go/no-go threshold and approvalMEASURE: deployment gateModel approval memorandum
Ongoing monitoring metrics and review cadenceMANAGE: monitoring planOperational monitoring documentation
Incident log for confabulation eventsMANAGE: incident responseIssues management tracker

This documentation structure mirrors what SR 26-02 requires for traditional model validation — the same rigor applied to GenAI’s specific risk profile.


So What?

The shift you need to make is treating confabulation as a model risk governance question, not a product quality question. Those are governed by different teams, different documentation standards, and different escalation paths.

NIST AI 600-1 gives you the framework requirement. What it doesn’t give you is the testing infrastructure, the threshold-setting methodology, or the documentation templates — those are program-design decisions that your team has to make. Start with the highest-consequence deployments: any GenAI touching customer decisions, regulatory submissions, or compliance advisory functions. Build the TEVV there first.

The NIST AI 600-1 overview post covers all 12 risk categories in the framework. If you haven’t read the existing LLM hallucination management guide, that covers the detection and mitigation controls from a broader risk management lens — this post covers the regulatory compliance angle specifically.


Need a framework for AI model governance that includes pre-deployment checklists aligned to NIST AI 600-1? The AI Risk Assessment Template & Guide includes confabulation risk assessment tools, TEVV documentation templates, and a model inventory designed for financial services teams.

Frequently Asked Questions

What is confabulation in NIST AI 600-1?
NIST AI 600-1 defines confabulation as 'the production of confidently stated but erroneous or false content' by generative AI systems. It is listed as one of 12 primary GenAI risk categories and includes outputs that diverge from the prompt, contradict prior statements, or fabricate facts, citations, or entities that don't exist.
How does NIST AI 600-1 differ from just calling it hallucination?
NIST uses confabulation specifically because it captures the technical mechanism — LLMs generating statistically plausible but factually wrong outputs — rather than the colloquial 'hallucination' framing. The term emphasizes the root cause: the model produces content that fills a gap in its training distribution, not that it perceives something incorrectly.
What TEVV does NIST AI 600-1 require for confabulation?
NIST AI 600-1 requires pre-deployment TEVV that specifically assesses confabulation rates across domain-specific tasks, with performance thresholds and go/no-go deployment gates. Testing should include red-teaming, domain-specific benchmark evaluation, and RAG grounding assessment. Monitoring must continue post-deployment.
What benchmarks can I use to test for confabulation?
Common benchmarks include TruthfulQA (tests whether LLMs propagate common misconceptions), HalluLens (hallucination benchmark presented at ACL 2025), and FactScore (for factual precision in generation). For document-grounded tasks (like contract review or regulatory analysis), RAG-specific benchmarks that test whether retrieval actually grounds outputs are more relevant.
Does RAG eliminate confabulation risk?
No, but it significantly reduces it. RAG pipelines that ground responses in verified external documents reduce hallucination rates by 60-80% in production systems. However, RAG introduces its own failure modes: retrieval failures (wrong documents retrieved), attribution errors (correct docs retrieved but not cited), and boundary violations (model goes off-document). All require testing.
How should financial services firms document confabulation controls for examiners?
Documentation should include: (1) the confabulation risk classification from your AI use case inventory, (2) pre-deployment TEVV results with benchmark scores and thresholds, (3) the go/no-go decision and approval, (4) post-deployment monitoring metrics and review cadence, and (5) any incidents or near-misses logged. Tie each element to NIST AI 600-1 Section 2.5 and the relevant MEASURE function subcategories.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.