NIST AI RMF MEASURE Function: TEVV, Bias Testing, and Metrics That Actually Matter

April 22, 2026 • Rebecca Leung •

Table of Contents

TL;DR:

MEASURE is the NIST AI RMF’s “show your work” function — where MAP’s risk context becomes quantified evidence

TEVV (Test, Evaluation, Validation, Verification) is the structured methodology that powers MEASURE, covering everything from pre-deployment testing to continuous production monitoring

MEASURE 2 has 13 subcategories covering all seven trustworthy AI characteristics — each requires documented outputs, not just processes

For financial services, MEASURE maps directly to SR 11-7 model validation — but it extends it for bias, privacy, and environmental risk in ways SR 11-7 doesn’t address

Most AI governance teams treat the NIST AI RMF like a compliance checklist. They build an inventory (GOVERN), do a risk classification (MAP), and then move on — skipping MEASURE because “we have model validation.”

That’s a problem. MEASURE is where governance stops being theoretical. It’s where you move from “we identified these AI risks during MAP” to “here’s our quantified evidence that those risks are within tolerance.” Without MEASURE, you have a risk program built on declarations. With it, you have one built on evidence.

Regulators increasingly know the difference.

What MEASURE Is — and Where It Sits

The NIST AI RMF has four functions: GOVERN (culture, accountability, policy), MAP (context framing, risk classification), MEASURE (quantification and analysis), and MANAGE (treatment, response, recovery). They’re not a linear sequence — the framework is designed as a continuous loop. But understanding the sequence helps.

MAP creates the context: what is this AI system’s purpose, who uses it, what are the potential harms, and which risks are categorized as significant? MEASURE takes that context and asks: how do we know if those risks are actually within tolerance? What evidence would convince a skeptical examiner?

The GOVERN function establishes the accountability structures and policies. MEASURE operationalizes them.

MEASURE is built around four main categories:

MEASURE 1 — Selecting appropriate methods and metrics for identified risks
MEASURE 2 — Evaluating all seven trustworthy AI characteristics across 13 subcategories
MEASURE 3 — Tracking emergent and ongoing risks
MEASURE 4 — Feeding measurement efficacy back into the program

TEVV: The Engine of MEASURE

TEVV stands for Test, Evaluation, Validation, and Verification. It’s the structured methodology NIST uses to operationalize MEASURE, and it runs across all four categories.

Here’s how each component maps to AI governance activities:

TEVV Component	When It Happens	What It Covers
Testing	Pre-deployment	Structured examination of model behavior using defined test sets, adversarial inputs, and edge cases
Evaluation	Ongoing (pre + post)	Performance assessment against baseline metrics, benchmark comparisons, and demographic analysis
Validation	Pre-deployment + major changes	Confirming the model is appropriate for its intended purpose and deployment context
Verification	Pre-deployment	Confirming the model meets its technical specifications and design requirements

In financial services, TEVV maps onto the SR 11-7 model validation lifecycle: conceptual soundness review (Validation), outcome analysis (Evaluation), sensitivity analysis (Testing), and ongoing monitoring (Evaluation + Testing continuously). But TEVV extends SR 11-7 in two important directions: it explicitly covers bias evaluation (which SR 11-7 addresses only implicitly) and it includes environmental impact assessment (which SR 11-7 doesn’t address at all).

MEASURE 2.1 requires organizations to document the specific test sets, metrics, and TEVV tools they use. This is the artifact that examiners will ask for.

MEASURE 1: Choosing Methods That Match the Risk

Before any evaluation starts, you have to select appropriate measurement approaches for the risks you identified in MAP. That sounds obvious, but it’s where many programs go wrong.

MEASURE 1.1 requires selecting measurement approaches for significant AI risks — and documenting which risks cannot yet be measured with available techniques. That second part matters. “We identified this risk but have no reliable way to quantify it currently” is an acceptable answer in NIST AI RMF, as long as it’s documented. An undocumented unmeasurable risk is a governance gap. A documented one is a limitation disclosure.

MEASURE 1.2 requires regularly assessing whether your chosen metrics are still appropriate and whether controls are still effective. A metric that was valid at deployment may stop being valid as the model’s operating context changes. This feeds directly into your model risk tiering review cycle.

MEASURE 1.3 requires involvement of independent experts and domain specialists in evaluations — particularly for high-risk AI systems. This is the NIST AI RMF’s version of SR 11-7’s independent validation requirement. For financial institutions already running formal model validation, this is table stakes. For fintechs without dedicated model risk teams, it means commissioning external validation for your highest-risk models at least annually.

MEASURE 2: The 13 Trustworthy Characteristics

This is the core of MEASURE — where the seven trustworthy AI characteristics (valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed) get quantified and documented. MEASURE 2 has 13 subcategories:

Subcategory	What It Requires	Output
2.1	Document test sets, metrics, and TEVV tools	TEVV methodology documentation
2.2	Human subject evaluations meeting protection requirements and population representativeness	Evaluation protocol + IRB compliance (if applicable)
2.3	Performance measurement under deployment-like conditions	Pre-deployment test results
2.4	Monitor system functionality and behavior during production	Ongoing monitoring dashboard/log
2.5	Demonstrate validation and reliability, document generalizability limitations	Validation report with scope limitations
2.6	Regular safety risk evaluation, demonstrating safe operation within risk tolerance	Safety evaluation log
2.7	Document security and resilience evaluation	Adversarial testing + red team results
2.8	Examine transparency and accountability risks	Accountability documentation, audit trails
2.9	Document model explanation, validation, and output interpretation	Explainability artifacts (SHAP values, feature importance, etc.)
2.10	Examine and document privacy risks	Privacy impact assessment
2.11	Document fairness and bias evaluation results	Bias testing documentation with disaggregated results
2.12	Assess environmental impact and sustainability	Carbon/energy metrics (where material)
2.13	Evaluate TEVV metrics and process effectiveness	Meta-evaluation of the measurement program itself

For financial services, the highest-stakes subcategories are 2.3, 2.4, 2.9, and 2.11.

MEASURE 2.3 and 2.4: Pre-Deployment vs. Production

MEASURE 2.3 covers testing under deployment-like conditions before go-live. For a credit decisioning model, this means running the model against a holdout dataset that reflects the actual demographic distribution of your expected borrower population — not a sanitized development set. The results need to be documented and reviewed before deployment approval.

MEASURE 2.4 is the continuous post-deployment monitoring requirement. This is where drift detection and continuous monitoring live. Minimum expectations: track accuracy and error rates monthly, flag deviations beyond defined thresholds, and document interventions. For high-risk models (credit, fraud, underwriting), quarterly comprehensive reviews are the baseline expectation in SR 11-7-aligned programs.

MEASURE 2.9: Explainability Isn’t Optional

Regulators in financial services have been explicit: AI models used in consumer lending must produce explanations specific enough to satisfy Regulation B adverse action notice requirements. MEASURE 2.9 requires documenting your explanation methodology and validating that outputs are accurate — meaning the explanations actually reflect what the model considered.

The documentation artifact here matters. “We use SHAP values” is not documentation. “We use SHAP values, our explanation validation confirms they correlate with actual model outputs at r=0.94, and here’s our mapping from SHAP to adverse action notice language” is documentation.

MEASURE 2.11: Bias Testing Requirements

MEASURE 2.11 is one of the most operationally demanding subcategories. It requires documenting fairness and bias evaluation results — not just running the tests, but capturing what you found and what you did about it.

NIST identifies five harm types that bias testing should address:

Harm Type	What It Means	Financial Services Example
Allocational	System allocates resources or opportunities unfairly	Credit denials disparately impacting protected classes
Representational	System stereotypes or demeans certain groups	Marketing AI targeting based on inferred demographics
Quality of service	System performs differently by group	Fraud models with higher false positive rates for certain ethnicities
Stereotyping	System reinforces harmful generalizations	Customer service AI treating certain names differently
Erasure	System renders certain groups invisible	Underwriting models trained on non-representative data

Required documentation under MEASURE 2.11:

Demographic parity analysis (overall approval/denial rates by protected class)
Equal opportunity analysis (true positive rates by protected class)
Disparate impact testing at the 80% four-fifths rule threshold
Intersectional analysis (not just race OR gender, but race AND gender)
Results by geographic region if operating in multiple markets
Methodology for any disparities found to be statistically significant
Remediation actions taken, if any

This extends and complements what’s required under SR 11-7 for AI model validation. SR 11-7 requires ongoing monitoring; MEASURE 2.11 specifies the types of bias analysis required within that monitoring.

For detailed testing methodologies — disparate impact analysis, counterfactual fairness, calibration testing — the bias testing guide covers the statistical methods in depth.

MEASURE 3: Tracking Emergent Risks

Risks don’t stay static. A model that passed all MEASURE 2 tests at deployment may develop new failure modes as data distributions shift, as new use patterns emerge, or as the broader operating environment changes.

MEASURE 3.1 requires ongoing identification and tracking of existing and emergent risks. In practice, this means maintaining a living risk log for each AI system — distinct from the model inventory — that captures newly observed behaviors, near-misses, edge cases encountered in production, and any changes in the deployment context that could affect risk.

MEASURE 3.2 requires documenting approaches for risks that are difficult to assess with current techniques. Hallucination in large language models is the obvious current example: we know it’s a risk, but reliable measurement techniques are still maturing. Documenting “we cannot yet reliably quantify hallucination frequency; we mitigate through human-in-the-loop review for all high-stakes outputs” is the right answer.

MEASURE 3.3 requires feedback processes that let end users report problems and appeal outcomes. For consumer-facing AI in financial services, this links directly to complaint management and adverse action appeal processes. If your credit model denies someone, and they appeal, that appeal process generates data that feeds MEASURE 3.1’s emergent risk tracking.

MEASURE 4: Does Your Measurement Program Actually Work?

MEASURE 4 is the meta-level — assessing whether the measurement activities themselves are valid and effective.

MEASURE 4.1 requires that measurement approaches be informed by domain experts. For financial services AI, that means the people defining your metrics should include model risk managers, compliance officers, and business line owners — not just data scientists.

MEASURE 4.2 requires that deployment context results be informed by relevant stakeholders. If your credit model is deployed in a community bank context, community feedback on model performance should be captured and reviewed — not just internal metrics.

MEASURE 4.3 requires documenting performance improvements or declines based on field data. This is the learning loop: what did production data tell you that your pre-deployment testing didn’t? How are you incorporating that into model development and validation cycles?

MEASURE in Financial Services: Practical Implementation

Here’s a realistic 90-day build-out for a financial services team standing up MEASURE compliance for the first time:

Days 1–30: Method Selection and Documentation

Complete MEASURE 1.1: For each significant AI risk identified in MAP, document the measurement approach and accept/document unmeasurable risks
Inventory existing test sets and performance metrics; identify gaps
Define performance thresholds with documented course-correction triggers

Days 31–60: Build Out MEASURE 2 Documentation

Prioritize subcategories by regulatory exposure: start with 2.3 (pre-deployment testing), 2.4 (production monitoring), 2.9 (explainability), and 2.11 (bias testing)
Run initial bias evaluation using disaggregated methods; document results
Confirm explainability artifacts are accurate (validate explanations against actual model behavior)
Schedule MEASURE 1.3 independent review for any Tier 1 or Tier 2 AI models

Days 61–90: Implement MEASURE 3 and 4

Stand up the AI risk incident log for MEASURE 3.1
Document limitations for risks in MEASURE 3.2 (hallucination, complex GenAI behaviors)
Ensure complaint/feedback channels feed MEASURE 3.3
Run a meta-evaluation of your measurement program against MEASURE 4.1–4.3 criteria

The “So What” for Your Exam

Regulators aren’t yet citing NIST AI RMF subcategory numbers in examination findings. But they’re asking questions that map directly to MEASURE:

“What testing did you do before this model went live?” → MEASURE 2.3
“How are you monitoring this model in production?” → MEASURE 2.4
“Can you explain the output for this specific customer?” → MEASURE 2.9
“Did you test for disparate impact?” → MEASURE 2.11
“How do you know your monitoring metrics are the right ones?” → MEASURE 1.1 and 4.3

The organizations that can answer all five of those questions with documented artifacts are the ones that come out of model risk exams with findings marked “informational” rather than MRAs.

The AI Risk Assessment Template includes a model monitoring dashboard template and bias assessment guide built around SR 11-7 and NIST AI RMF alignment — if you need a starting point for the measurement artifacts MEASURE requires.

FAQ

What’s the connection between NIST AI RMF MEASURE and NIST AI 600-1?

NIST AI 600-1 (the Generative AI Profile) adds specificity to MEASURE for GenAI systems. It introduces GenAI-specific risk categories — confabulation, data privacy, homogenization, value chain impacts — that require measurement approaches beyond what MEASURE 2’s general subcategories cover. If you’re deploying any LLM-based systems, you’ll need AI 600-1 alongside the core MEASURE playbook.

How often should we run MEASURE 2.11 bias evaluations?

At minimum annually for all AI models used in consumer decision-making. For high-risk models (credit underwriting, fraud scoring with consumer impact, collection prioritization), quarterly is the defensible frequency given SR 11-7’s model monitoring expectations. For models with known demographic sensitivity, some organizations run monthly disaggregated evaluation as a condition of continued deployment approval.

Do we need to document MEASURE 2.12 (environmental impact) for financial services AI?

Currently no financial services regulator in the US requires environmental impact assessment for AI systems specifically. But MEASURE 2.12 is still worth including in documentation for larger-scale GenAI deployments — particularly if your institution has public ESG commitments. Documenting it demonstrates comprehensive governance even where it’s not yet required.

Can we use our existing model validation process to satisfy MEASURE?

Partially. SR 11-7 model validation covers the core of MEASURE 2.1–2.5. But most existing SR 11-7 validation processes don’t systematically address MEASURE 2.10 (privacy), 2.11 (bias with the NIST-specified harm categories), or 2.12 (environmental impact). If you’re using an SR 11-7 framework as your baseline, layer MEASURE 2.10–2.12 on top of it rather than rebuilding from scratch.

What should we do when a MEASURE evaluation produces results outside acceptable thresholds?

MEASURE doesn’t prescribe the response — that’s MANAGE’s job. But MEASURE 2.6 requires safety evaluation documentation that includes course-correction suggestions for when systems exceed acceptable limits. In practice: pre-define your escalation triggers (e.g., “if demographic parity drops below 0.80, trigger independent review”), document them in your TEVV methodology, and make sure the MANAGE function has a response protocol ready before the threshold is breached, not after.

Frequently Asked Questions

What is the MEASURE function in the NIST AI RMF?

MEASURE is the third function in the NIST AI RMF (after GOVERN and MAP). It applies quantitative, qualitative, or mixed-method tools to analyze, assess, benchmark, and monitor AI risk. MEASURE has four main categories: selecting appropriate methods (1), evaluating trustworthy AI characteristics across 13 subcategories (2), tracking emergent risks (3), and feeding measurement results back into the program (4).

What is TEVV in the NIST AI RMF?

TEVV stands for Test, Evaluation, Validation, and Verification. Under the NIST AI RMF MEASURE function, TEVV describes the structured process of testing AI system behavior before and after deployment: Testing (structured pre-deployment examination), Evaluation (ongoing performance assessment against baseline), Validation (confirming the system is appropriate for its intended purpose), and Verification (confirming the system meets its technical specifications).

What does NIST AI RMF MEASURE 2.11 require for bias testing?

MEASURE 2.11 requires organizations to document fairness and bias evaluation results. This includes identifying harm types (allocational, representational, quality of service, stereotyping, and erasure harms), analyzing disparities across and within demographic groups including intersecting groups, using pre-processing data transformations, in-processing model adjustments, and post-processing techniques as appropriate, and employing disaggregated evaluation methods by race, age, gender, ethnicity, ability, and region.

How does the NIST AI RMF MEASURE function relate to SR 11-7 model validation?

MEASURE aligns closely with SR 11-7 model validation requirements. MEASURE 2.1-2.5 map directly to SR 11-7's conceptual soundness, ongoing monitoring, and outcomes analysis requirements. MEASURE 2.11 (bias testing) extends SR 11-7 for AI-specific fairness concerns. SR 11-7 requires independent validation; MEASURE 1.3 explicitly requires independent expert involvement. Together they form a defensible model risk governance structure.

What metrics should financial services firms track under the NIST AI RMF MEASURE function?

Financial services firms should track: accuracy, false positive/negative rates, and prediction latency (performance); demographic parity and equal opportunity rates by protected class (fairness); distribution shift and population stability index (drift); feature importance stability over time (model integrity); explainability coverage (what percent of decisions can be explained at required granularity); and incident counts by severity tier. Thresholds for each should be defined during MAP and documented with course-correction triggers.

What is the difference between MEASURE 2.3 and MEASURE 2.4 in the NIST AI RMF?

MEASURE 2.3 covers qualitative or quantitative performance measurement conducted under deployment-like conditions — essentially pre-deployment testing with realistic inputs. MEASURE 2.4 covers monitoring of actual system functionality and behavior during production — post-deployment surveillance once the model is live. Both are required; 2.3 happens before go-live, 2.4 is continuous afterward.

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

See What's Included → Buy Now — $59

Keep Reading

AI Risk

GenAI Supply Chain Risk: Third-Party Model Dependencies and NIST AI 600-1 Controls

Most financial institutions using GenAI APIs don't fully own their AI supply chain. NIST AI 600-1 says that's your problem. Here's what you need to control.

Apr 25, 2026

AI Risk

Developer vs. Deployer vs. Operator: Role-Specific Obligations Under NIST AI 600-1

NIST AI 600-1 assigns different GenAI risk obligations to developers, deployers, and operators. Here's what each role actually owns—and where the gaps live.

Apr 25, 2026

AI Risk

Generative AI Incident Disclosure and Content Provenance: NIST AI 600-1 Requirements

What NIST AI 600-1 requires when your GenAI system fails: incident disclosure obligations, after-action review requirements, and content provenance tracking.

Apr 24, 2026

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.