AI Risk

TEVV for Generative AI: Pre-Deployment Testing Requirements Under NIST AI 600-1

April 24, 2026 Rebecca Leung
Table of Contents

TL;DR:

  • NIST AI 600-1 requires pre-deployment TEVV (Test, Evaluate, Verify, Validate) across all 12 GenAI risk categories before deploying any generative AI system — not just accuracy testing.
  • The five highest-priority testing domains for financial services are: confabulation, harmful bias, information security, information integrity, and data privacy.
  • Structured red-team exercises are required under MP-2.3-005 — before deployment, not as an afterthought.
  • You need documented go/no-go gates (GV-1.3-002) and stop-build authority (GV-1.3-006) before testing starts, not after results are in.

Most compliance teams check a box when their AI vendor says the model passed internal safety testing. That’s not TEVV. That’s vendor attestation — and under NIST AI 600-1, it doesn’t satisfy your pre-deployment obligations as a deployer.

NIST AI 600-1, the Generative AI Profile released July 2024, is explicit: organizations that deploy generative AI to customers or in business processes cannot outsource TEVV to the model vendor. Your use case, your regulatory context, and your customer population require your own testing. The vendor tested their model. You need to test your deployment.

That distinction matters enormously for financial services. A foundation model might perform acceptably across general benchmarks. But when you deploy it for loan application assistance, your ECOA exposure is a function of how that model performs across your specific applicant population — not the vendor’s test set.

Here’s what the framework actually requires, and how to build a testing program that satisfies it.


Why Traditional Model Validation Falls Short for GenAI

SR 11-7 model validation was designed for a different kind of model. Traditional financial models — credit scores, fraud detection classifiers, AML monitoring systems — produce structured outputs against defined inputs. Validation tests statistical soundness: backtesting, sensitivity analysis, out-of-time performance benchmarking.

Generative AI systems produce probabilistic, open-ended outputs in natural language. The same input can produce different outputs on different runs. The failure modes aren’t underfitting or overfitting — they’re confabulation, bias amplification, prompt injection vulnerability, and privacy leakage. SR 11-7 wasn’t designed to catch any of these.

NIST AI 600-1 fills the gap. It provides 12 risk categories specific to or amplified by generative AI — the vocabulary and testing requirements that traditional model risk management simply didn’t need until now. The four functions of the broader AI RMF (GOVERN, MAP, MEASURE, MANAGE) structure how those requirements flow from governance policy through deployment and monitoring.

The NIST AI RMF MEASURE function covers TEVV methodology at a high level across all AI types. AI 600-1 makes it concrete for generative systems — specifying what you test, not just that you test.


What TEVV Means for GenAI

TEVV stands for Test, Evaluate, Verify, and Validate. For generative AI specifically, each component has a distinct function:

ComponentWhat It DoesGenAI Specific
TestStructured examination of system behavior before deploymentDomain-specific confabulation rates, adversarial probing, bias disparity testing
EvaluateBenchmark performance against pre-defined thresholdsDemographic parity ratios, confabulation error rates, injection success rates
VerifyConfirm the system meets its technical specificationsOutput format compliance, latency SLAs, model version matches documentation
ValidateConfirm the system is appropriate for its intended real-world useUse-case relevance, regulatory environment fit, customer population performance

The critical distinction: Verify asks “does it do what we built?” Validate asks “should we deploy it for this purpose?” Most teams do the former and skip the latter. AI 600-1 requires both.

For financial services, the Validate step is where ECOA, UDAAP, and fair lending obligations live. A GenAI system that technically does what it was designed to do can still generate discriminatory adverse action notices, confabulate regulatory requirements, or give customers legally inaccurate product information. Validation catches that before deployment.


The 12-Category Testing Matrix

NIST AI 600-1 defines 12 risk categories. Not all require equal testing depth for every use case — but you need to assess each and document why your testing coverage is sufficient for your specific deployment context.

Risk CategoryTesting RequiredFinancial Services Priority
ConfabulationBenchmark scoring on domain-specific queries, RAG grounding testsCritical — customer-facing or advisory uses
Harmful Bias & HomogenizationDemographic disparity analysis across protected classesCritical — any use touching underwriting, pricing, decisions
Information SecurityPrompt injection, data extraction, model extraction testingCritical — all deployments
Data PrivacyTraining data memorization, PII leakage testingHigh — all customer-data uses
Information IntegritySusceptibility to disinformation amplificationHigh — any public-facing or advisory use
Human-AI ConfigurationAutomation bias testing, appropriate escalationHigh — customer service, decisioning support
Value Chain & Component IntegrationThird-party model dependency risk, supply chain exposureHigh — all third-party GenAI deployments
Intellectual PropertyOutput copyright analysis, training data attributionMedium — varies by use case
CBRN InformationCapability assessment for dangerous content elicitationLow — unless model has broad generation capability
Dangerous/Violent ContentContent safety boundary testingLow-Medium — varies by customer interaction
Obscene/Degrading ContentContent safety boundary testingLow-Medium — varies by customer interaction
Environmental ImpactsCompute cost and energy efficiency assessmentLow — organizational sustainability reporting

Your MAP function work (risk classification and context framing) should determine which categories get deep testing versus a documented risk acceptance decision.


The Five Priority Testing Domains in Detail

1. Confabulation Testing

Confabulation — the production of confidently stated but erroneous content — is the GenAI failure mode most likely to create direct regulatory exposure in financial services. A customer chatbot that invents a fee structure that doesn’t exist in your disclosures. A compliance assistant that fabricates a regulatory citation. A document review tool that misses a clause by confidently describing something adjacent.

Pre-deployment confabulation testing requires:

  • Domain-specific benchmark evaluation: Don’t use general benchmarks. Test on queries drawn from your actual use case domain. If you’re deploying a mortgage assistance chatbot, test it on mortgage-specific fact scenarios where correct answers are verifiable against your product documentation.
  • RAG grounding assessment: If you’re using retrieval-augmented generation to ground outputs in verified documents, test whether retrieval actually constrains hallucination — or whether the model still goes off-document. Test boundary cases: what happens when the retrieved document doesn’t contain a clear answer?
  • Threshold definition: Define an acceptable confabulation rate before testing. GV-1.3-002 requires performance thresholds be established pre-testing so deployment decisions aren’t made post-hoc. For high-consequence customer-facing uses, a confabulation rate above 2-5% on domain-specific tasks should trigger a no-deploy decision.

See the full confabulation testing methodology for benchmark options and documentation format.

2. Harmful Bias Testing

For financial services, this is where ECOA and UDAAP exposure concentrates. AI 600-1’s Harmful Bias category requires demographic disparity analysis — testing whether the GenAI system’s outputs differ materially across protected class attributes.

Pre-deployment bias testing requires:

  • Disparate performance analysis: Does the system perform at the same accuracy rate across race, gender, age, national origin, and other protected characteristics? For GenAI, this includes output quality, response completeness, and confabulation rates disaggregated by simulated demographic scenarios.
  • Output tone and framing analysis: Does the system consistently use different language when describing similar situations to demographically different users? Subtle framing differences in customer communication can create UDAAP exposure.
  • Adverse action notice evaluation: For any GenAI used in decisioning support, evaluate whether the explanations generated for denial decisions satisfy ECOA’s adverse action notice requirements — including specificity and legal accuracy.

3. Information Security Testing

NIST AI 600-1 (MP-2.3-005) requires adversarial testing before and after deployment. This is not optional and it’s not satisfied by vendor attestation.

Required information security tests:

  • Direct prompt injection: Attempts to override system instructions through user input
  • Indirect prompt injection: Attempts to inject malicious instructions through retrieved documents or external data sources (critical for RAG deployments)
  • Data extraction probing: Attempts to elicit training data, system prompts, or other confidential information through structured queries
  • Model behavior boundary testing: Testing for jailbreak scenarios — attempts to circumvent content safety controls through role-play, hypotheticals, or encoded requests

For financial services deployments, also test: Can users elicit the system prompt? Can they extract other users’ conversation history? Can they cause the model to produce compliance-relevant content it’s configured to avoid?

4. Data Privacy Testing

Training data memorization is an underappreciated pre-deployment risk. Large language models can memorize and reproduce verbatim samples from training data — including PII, internal documents, or confidential financial information that appeared in training corpora.

Privacy pre-deployment testing includes:

  • Memorization probing: Structured queries designed to elicit verbatim reproduction of training data
  • PII leakage testing: Testing whether the model produces real names, account numbers, or other identifying information in response to partial prompts
  • Inference attack resistance: Testing whether model outputs reveal information that could allow re-identification of individuals from training data

5. Information Integrity Testing

Information integrity describes whether your GenAI system amplifies or resists disinformation. In a customer-facing context, this means: can adversarial users get your chatbot to repeat and appear to endorse false regulatory or product information? Can users manipulate the system into making statements that appear authoritative but are factually wrong?

This matters beyond internal risk — customer complaints arising from a GenAI system repeating user-supplied disinformation as fact create both UDAAP exposure and reputational risk.


Red-Team Testing Requirements

NIST AI 600-1 (MP-2.3-005) explicitly requires structured adversarial testing — red-team exercises — before deployment. Red-team testing goes beyond automated benchmark testing: it involves skilled testers actively trying to break the system using the same techniques adversarial users would employ.

Pre-deployment red-team scope should include:

  • All information security attack vectors described above
  • Scenario-based testing aligned to your specific use case (e.g., a mortgage chatbot red-team that tests for confabulated interest rate quotes, false regulatory claims, and discriminatory framing)
  • Testing by personnel outside the team that built the system — independent assessment is a consistent NIST requirement

Who should conduct the red team? At large institutions with dedicated AI risk teams, this is an internal function separate from the model development team. At smaller fintechs or banks, this is a case for bringing in external expertise — at minimum for the first deployment of a new GenAI use case type. The independence requirement isn’t ceremonial; it’s how you catch the things the development team is blind to.

Red-team documentation should capture: tester identities and independence from development, scope and methodology, all significant findings, mitigations applied, and a residual risk conclusion that feeds into the go/no-go decision.


Go/No-Go Gates and Stop-Build Authority

This is where most organizations’ GenAI governance has the biggest gap. Testing without pre-defined deployment criteria is theater — it generates findings but doesn’t drive decisions.

NIST AI 600-1 requires:

GV-1.3-002: Establish performance thresholds before testing. Define what passing looks like before any test data is in. This prevents post-hoc rationalization where “acceptable” shifts to match whatever results come back. Thresholds should be documented in your AI use case approval package and approved by model risk or governance before testing begins.

GV-1.3-006 and GV-1.3-007: Stop-build authority. The framework requires a defined policy and a named role empowered to halt development or deployment when testing reveals unacceptable risk — regardless of business pressure, schedule, or investment sunk. At most mid-size banks, this lives with the CRO or Chief Model Risk Officer. At fintechs, it’s often the Head of Compliance or equivalent first- or second-line risk leader.

Stop-build authority only works if it’s documented and the authority has teeth. If the executive team can simply override a compliance objection without a formal risk acceptance process, the control doesn’t exist.


Post-Deployment: When TEVV Continues

Pre-deployment TEVV is not a one-time event. NIST AI 600-1 requires continuous monitoring post-deployment because generative AI behavior can drift even when the model version doesn’t change — as the input distribution shifts, as fine-tuning is applied, or as the system prompt evolves.

Post-deployment requirements include:

  • Ongoing confabulation monitoring: Track confabulation rates on production queries, with a cadence defined in the deployment approval (monthly for high-risk uses, quarterly for lower-risk)
  • Ongoing bias monitoring: Monitor demographic performance disparities in production outputs, with triggers for re-evaluation if disparity metrics exceed thresholds
  • Periodic re-red-teaming: Schedule adversarial testing at defined intervals after deployment — at minimum annually, and triggered by any significant model update or use case change
  • Incident tracking: Log any outputs that triggered user complaints, regulatory inquiries, or internal flags as AI incidents (connected to the incident disclosure obligations covered separately in NIST AI 600-1’s content provenance and incident disclosure requirements)

How to Document TEVV for Examiners

Examiners applying SR 11-7 principles to GenAI deployments will expect documentation that mirrors a model validation report — but extended to cover the GenAI-specific testing domains.

Your TEVV documentation package should include:

  1. Pre-deployment test plan: Use case description, risk tier, applicable AI 600-1 risk categories, testing scope and methodology, defined pass/fail thresholds (GV-1.3-002)
  2. Test results by risk category: Benchmark scores and methodology for confabulation, bias disparity analysis results by demographic group, red-team findings and severity ratings, information security test results
  3. Go/no-go decision documentation: Who made the deployment decision, what findings were considered, any mitigations applied before go-live, and documented risk acceptances for any findings within tolerance
  4. Post-deployment monitoring plan: Metrics, cadence, responsible owner, escalation triggers

For third-party GenAI deployments, your documentation also needs to cover the Value Chain risk category — what testing the vendor conducted, what access you have to their test results, and what independent testing you ran on your use case layer on top.


So What?

The regulators haven’t published a GenAI TEVV examination module yet — but they’re applying AI 600-1 principles under existing SR 11-7 and safety-and-soundness authorities right now. The 2024 interagency request for information on AI, OCC Bulletin 2025-26, and every bank partner questionnaire on AI governance all point the same direction: demonstrate your pre-deployment testing, or explain why you didn’t think it was necessary.

“We tested it informally before we launched” isn’t going to hold up. “Here’s our TEVV plan, our threshold documentation, our red-team report, and our go/no-go decision” will.

If you’re building or deploying GenAI in financial services and you don’t have a structured pre-deployment testing program, start with the AI 600-1 risk category overview, prioritize the five testing domains most applicable to your use case, and build your threshold documentation before you start testing.

The AI Risk Assessment Template & Guide includes a pre-deployment assessment scorecard covering all 11 AI risk domains, worked examples for GenAI use cases, and documentation templates built for examiner review — so you’re not starting from a blank spreadsheet.


Frequently Asked Questions

Does NIST AI 600-1 pre-deployment testing apply to third-party GenAI tools we didn’t build?

Yes, and this is one of the most important points in the framework. If you are deploying a third-party GenAI tool — whether it’s a vendor chatbot, an AI-assisted underwriting tool, or a compliance workflow automation — you are a “deployer” under NIST AI 600-1, and your testing obligations are not satisfied by the vendor’s testing. You must conduct TEVV appropriate to your specific use case, regulatory context, and customer population. The vendor’s safety card is informative, not sufficient.

How long does pre-deployment TEVV take for a GenAI system?

It depends on use case complexity and risk tier. A low-risk internal productivity tool with no customer-facing output might require two to three weeks of testing and documentation. A customer-facing GenAI system touching regulated decisioning or advice — mortgage chatbot, credit application support, investment information — realistically requires six to ten weeks for a thorough TEVV including red-team testing, bias analysis, and documentation. Plan for it before you commit to a deployment timeline.

Can we use the vendor’s red-team results to satisfy MP-2.3-005?

Partially. Vendor red-team results are useful inputs and should be requested and reviewed. But they test the model in isolation — not your specific deployment configuration, system prompt, customer population, or use case. Your adversarial testing needs to test the full deployment: model + configuration + retrieval system + user interface + customer interaction patterns. Most vendor testing won’t cover that stack.

What if testing reveals a significant finding we can’t fix before the planned launch date?

Stop-build authority (GV-1.3-006) exists for exactly this situation. The framework requires that deployment be blocked when testing reveals risk above defined tolerance — regardless of schedule. The operational answer is: document the finding, assess severity against your defined thresholds, and either fix it before launching or document a formal risk acceptance with appropriate controls and a remediation timeline. Launching with a known unmitigated finding and no formal risk acceptance is the worst outcome — it transforms a governance gap into a deliberate decision that will look very bad in an examination.

How does this interact with our existing model risk management process?

NIST AI 600-1 TEVV is additive to SR 11-7 model validation, not a replacement. For GenAI systems that meet SR 11-7’s definition of a “model,” you still need traditional validation: conceptual soundness review, backtesting, benchmarking against alternatives. AI 600-1 TEVV adds the GenAI-specific testing domains on top — confabulation, adversarial robustness, bias across output modalities. Your model validation policy should explicitly address how these requirements interact, and your model risk committee should have approved the testing framework before you’re facing your first GenAI deployment decision.

Frequently Asked Questions

What does TEVV mean in NIST AI 600-1 for generative AI?
TEVV stands for Test, Evaluate, Verify, and Validate. In the context of NIST AI 600-1, it refers to the structured pre-deployment testing methodology specifically designed for generative AI systems. Testing examines known risk categories before deployment; Evaluation benchmarks performance against defined thresholds; Verification confirms the system meets its technical specifications; and Validation confirms it is appropriate for its intended real-world use case. For GenAI, TEVV must cover confabulation, bias, adversarial robustness, information security, and information integrity — not just accuracy metrics.
What must be tested before deploying a generative AI system under NIST AI 600-1?
NIST AI 600-1 requires pre-deployment testing across its 12 risk categories, with five as immediate priorities: confabulation (false output rates on domain-specific tasks), harmful bias (performance disparities across demographic groups), information security (prompt injection and data extraction vulnerabilities), information integrity (susceptibility to disinformation amplification), and data privacy (training data leakage and memorization). The framework also requires structured red-team exercises and stop-build authority — a defined role empowered to halt deployment if testing reveals unacceptable risk.
How is GenAI TEVV different from traditional SR 11-7 model validation?
Traditional SR 11-7 validation tests whether a model's outputs are statistically sound: backtesting, sensitivity analysis, and benchmarking against historical data. GenAI TEVV tests whether probabilistic, generative outputs are safe, fair, accurate, and robustness to adversarial manipulation — none of which traditional model validation frameworks were built to assess. SR 11-7 didn't anticipate prompt injection attacks, confabulation in domain-specific queries, or bias across output modalities. AI 600-1 fills that gap.
Is red-team testing required under NIST AI 600-1 before deployment?
Yes. NIST AI 600-1 (MP-2.3-005) requires that generative AI systems undergo adversarial testing — red-team exercises — to identify vulnerabilities and potential manipulation or misuse, both before and after deployment. Pre-deployment red-teaming should include prompt injection attempts, jailbreak scenarios, and targeted tests for the risk categories most relevant to your use case. Financial services firms should also test for ECOA/UDAAP-relevant bias in any GenAI touching underwriting, pricing, or customer decisioning.
What are go/no-go gates under NIST AI 600-1 for GenAI deployment?
NIST AI 600-1 (GV-1.3-002) requires that performance thresholds be established before testing begins and that deployment decisions be made against those thresholds — not after the fact. For financial services, go/no-go gates should include: maximum acceptable confabulation rate for domain-specific queries, maximum acceptable demographic performance disparity across protected classes, pass/fail for prompt injection vulnerability, and a privacy leakage threshold. GV-1.3-006 and GV-1.3-007 also require a stop-build authority — a named role empowered to block deployment regardless of business pressure.
How do I document TEVV results for a bank examiner?
Your documentation package should include: the pre-deployment test plan with defined thresholds (per GV-1.3-002), test results across each applicable risk category with benchmark scores, the go/no-go decision and who made it (per GV-1.3-006), any findings and mitigations applied before deployment, and the post-deployment monitoring plan and cadence. Tie each section to its corresponding NIST AI 600-1 action. Most examiners following SR 11-7 principles will expect to see this organized like a model validation report — but extended to cover bias, adversarial robustness, and confabulation.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.