AI Risk

NIST AI RMF MEASURE Function: TEVV, Bias Testing, and Metrics That Actually Matter

April 22, 2026 Rebecca Leung
Table of Contents

TL;DR:

  • MEASURE is the NIST AI RMF’s “show your work” function — where MAP’s risk context becomes quantified evidence
  • TEVV (Test, Evaluation, Validation, Verification) is the structured methodology that powers MEASURE, covering everything from pre-deployment testing to continuous production monitoring
  • MEASURE 2 has 13 subcategories covering all seven trustworthy AI characteristics — each requires documented outputs, not just processes
  • For financial services, MEASURE maps directly to SR 11-7 model validation — but it extends it for bias, privacy, and environmental risk in ways SR 11-7 doesn’t address

Most AI governance teams treat the NIST AI RMF like a compliance checklist. They build an inventory (GOVERN), do a risk classification (MAP), and then move on — skipping MEASURE because “we have model validation.”

That’s a problem. MEASURE is where governance stops being theoretical. It’s where you move from “we identified these AI risks during MAP” to “here’s our quantified evidence that those risks are within tolerance.” Without MEASURE, you have a risk program built on declarations. With it, you have one built on evidence.

Regulators increasingly know the difference.


What MEASURE Is — and Where It Sits

The NIST AI RMF has four functions: GOVERN (culture, accountability, policy), MAP (context framing, risk classification), MEASURE (quantification and analysis), and MANAGE (treatment, response, recovery). They’re not a linear sequence — the framework is designed as a continuous loop. But understanding the sequence helps.

MAP creates the context: what is this AI system’s purpose, who uses it, what are the potential harms, and which risks are categorized as significant? MEASURE takes that context and asks: how do we know if those risks are actually within tolerance? What evidence would convince a skeptical examiner?

The GOVERN function establishes the accountability structures and policies. MEASURE operationalizes them.

MEASURE is built around four main categories:

  • MEASURE 1 — Selecting appropriate methods and metrics for identified risks
  • MEASURE 2 — Evaluating all seven trustworthy AI characteristics across 13 subcategories
  • MEASURE 3 — Tracking emergent and ongoing risks
  • MEASURE 4 — Feeding measurement efficacy back into the program

TEVV: The Engine of MEASURE

TEVV stands for Test, Evaluation, Validation, and Verification. It’s the structured methodology NIST uses to operationalize MEASURE, and it runs across all four categories.

Here’s how each component maps to AI governance activities:

TEVV ComponentWhen It HappensWhat It Covers
TestingPre-deploymentStructured examination of model behavior using defined test sets, adversarial inputs, and edge cases
EvaluationOngoing (pre + post)Performance assessment against baseline metrics, benchmark comparisons, and demographic analysis
ValidationPre-deployment + major changesConfirming the model is appropriate for its intended purpose and deployment context
VerificationPre-deploymentConfirming the model meets its technical specifications and design requirements

In financial services, TEVV maps onto the SR 11-7 model validation lifecycle: conceptual soundness review (Validation), outcome analysis (Evaluation), sensitivity analysis (Testing), and ongoing monitoring (Evaluation + Testing continuously). But TEVV extends SR 11-7 in two important directions: it explicitly covers bias evaluation (which SR 11-7 addresses only implicitly) and it includes environmental impact assessment (which SR 11-7 doesn’t address at all).

MEASURE 2.1 requires organizations to document the specific test sets, metrics, and TEVV tools they use. This is the artifact that examiners will ask for.


MEASURE 1: Choosing Methods That Match the Risk

Before any evaluation starts, you have to select appropriate measurement approaches for the risks you identified in MAP. That sounds obvious, but it’s where many programs go wrong.

MEASURE 1.1 requires selecting measurement approaches for significant AI risks — and documenting which risks cannot yet be measured with available techniques. That second part matters. “We identified this risk but have no reliable way to quantify it currently” is an acceptable answer in NIST AI RMF, as long as it’s documented. An undocumented unmeasurable risk is a governance gap. A documented one is a limitation disclosure.

MEASURE 1.2 requires regularly assessing whether your chosen metrics are still appropriate and whether controls are still effective. A metric that was valid at deployment may stop being valid as the model’s operating context changes. This feeds directly into your model risk tiering review cycle.

MEASURE 1.3 requires involvement of independent experts and domain specialists in evaluations — particularly for high-risk AI systems. This is the NIST AI RMF’s version of SR 11-7’s independent validation requirement. For financial institutions already running formal model validation, this is table stakes. For fintechs without dedicated model risk teams, it means commissioning external validation for your highest-risk models at least annually.


MEASURE 2: The 13 Trustworthy Characteristics

This is the core of MEASURE — where the seven trustworthy AI characteristics (valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed) get quantified and documented. MEASURE 2 has 13 subcategories:

SubcategoryWhat It RequiresOutput
2.1Document test sets, metrics, and TEVV toolsTEVV methodology documentation
2.2Human subject evaluations meeting protection requirements and population representativenessEvaluation protocol + IRB compliance (if applicable)
2.3Performance measurement under deployment-like conditionsPre-deployment test results
2.4Monitor system functionality and behavior during productionOngoing monitoring dashboard/log
2.5Demonstrate validation and reliability, document generalizability limitationsValidation report with scope limitations
2.6Regular safety risk evaluation, demonstrating safe operation within risk toleranceSafety evaluation log
2.7Document security and resilience evaluationAdversarial testing + red team results
2.8Examine transparency and accountability risksAccountability documentation, audit trails
2.9Document model explanation, validation, and output interpretationExplainability artifacts (SHAP values, feature importance, etc.)
2.10Examine and document privacy risksPrivacy impact assessment
2.11Document fairness and bias evaluation resultsBias testing documentation with disaggregated results
2.12Assess environmental impact and sustainabilityCarbon/energy metrics (where material)
2.13Evaluate TEVV metrics and process effectivenessMeta-evaluation of the measurement program itself

For financial services, the highest-stakes subcategories are 2.3, 2.4, 2.9, and 2.11.

MEASURE 2.3 and 2.4: Pre-Deployment vs. Production

MEASURE 2.3 covers testing under deployment-like conditions before go-live. For a credit decisioning model, this means running the model against a holdout dataset that reflects the actual demographic distribution of your expected borrower population — not a sanitized development set. The results need to be documented and reviewed before deployment approval.

MEASURE 2.4 is the continuous post-deployment monitoring requirement. This is where drift detection and continuous monitoring live. Minimum expectations: track accuracy and error rates monthly, flag deviations beyond defined thresholds, and document interventions. For high-risk models (credit, fraud, underwriting), quarterly comprehensive reviews are the baseline expectation in SR 11-7-aligned programs.

MEASURE 2.9: Explainability Isn’t Optional

Regulators in financial services have been explicit: AI models used in consumer lending must produce explanations specific enough to satisfy Regulation B adverse action notice requirements. MEASURE 2.9 requires documenting your explanation methodology and validating that outputs are accurate — meaning the explanations actually reflect what the model considered.

The documentation artifact here matters. “We use SHAP values” is not documentation. “We use SHAP values, our explanation validation confirms they correlate with actual model outputs at r=0.94, and here’s our mapping from SHAP to adverse action notice language” is documentation.

MEASURE 2.11: Bias Testing Requirements

MEASURE 2.11 is one of the most operationally demanding subcategories. It requires documenting fairness and bias evaluation results — not just running the tests, but capturing what you found and what you did about it.

NIST identifies five harm types that bias testing should address:

Harm TypeWhat It MeansFinancial Services Example
AllocationalSystem allocates resources or opportunities unfairlyCredit denials disparately impacting protected classes
RepresentationalSystem stereotypes or demeans certain groupsMarketing AI targeting based on inferred demographics
Quality of serviceSystem performs differently by groupFraud models with higher false positive rates for certain ethnicities
StereotypingSystem reinforces harmful generalizationsCustomer service AI treating certain names differently
ErasureSystem renders certain groups invisibleUnderwriting models trained on non-representative data

Required documentation under MEASURE 2.11:

  • Demographic parity analysis (overall approval/denial rates by protected class)
  • Equal opportunity analysis (true positive rates by protected class)
  • Disparate impact testing at the 80% four-fifths rule threshold
  • Intersectional analysis (not just race OR gender, but race AND gender)
  • Results by geographic region if operating in multiple markets
  • Methodology for any disparities found to be statistically significant
  • Remediation actions taken, if any

This extends and complements what’s required under SR 11-7 for AI model validation. SR 11-7 requires ongoing monitoring; MEASURE 2.11 specifies the types of bias analysis required within that monitoring.

For detailed testing methodologies — disparate impact analysis, counterfactual fairness, calibration testing — the bias testing guide covers the statistical methods in depth.


MEASURE 3: Tracking Emergent Risks

Risks don’t stay static. A model that passed all MEASURE 2 tests at deployment may develop new failure modes as data distributions shift, as new use patterns emerge, or as the broader operating environment changes.

MEASURE 3.1 requires ongoing identification and tracking of existing and emergent risks. In practice, this means maintaining a living risk log for each AI system — distinct from the model inventory — that captures newly observed behaviors, near-misses, edge cases encountered in production, and any changes in the deployment context that could affect risk.

MEASURE 3.2 requires documenting approaches for risks that are difficult to assess with current techniques. Hallucination in large language models is the obvious current example: we know it’s a risk, but reliable measurement techniques are still maturing. Documenting “we cannot yet reliably quantify hallucination frequency; we mitigate through human-in-the-loop review for all high-stakes outputs” is the right answer.

MEASURE 3.3 requires feedback processes that let end users report problems and appeal outcomes. For consumer-facing AI in financial services, this links directly to complaint management and adverse action appeal processes. If your credit model denies someone, and they appeal, that appeal process generates data that feeds MEASURE 3.1’s emergent risk tracking.


MEASURE 4: Does Your Measurement Program Actually Work?

MEASURE 4 is the meta-level — assessing whether the measurement activities themselves are valid and effective.

MEASURE 4.1 requires that measurement approaches be informed by domain experts. For financial services AI, that means the people defining your metrics should include model risk managers, compliance officers, and business line owners — not just data scientists.

MEASURE 4.2 requires that deployment context results be informed by relevant stakeholders. If your credit model is deployed in a community bank context, community feedback on model performance should be captured and reviewed — not just internal metrics.

MEASURE 4.3 requires documenting performance improvements or declines based on field data. This is the learning loop: what did production data tell you that your pre-deployment testing didn’t? How are you incorporating that into model development and validation cycles?


MEASURE in Financial Services: Practical Implementation

Here’s a realistic 90-day build-out for a financial services team standing up MEASURE compliance for the first time:

Days 1–30: Method Selection and Documentation

  • Complete MEASURE 1.1: For each significant AI risk identified in MAP, document the measurement approach and accept/document unmeasurable risks
  • Inventory existing test sets and performance metrics; identify gaps
  • Define performance thresholds with documented course-correction triggers

Days 31–60: Build Out MEASURE 2 Documentation

  • Prioritize subcategories by regulatory exposure: start with 2.3 (pre-deployment testing), 2.4 (production monitoring), 2.9 (explainability), and 2.11 (bias testing)
  • Run initial bias evaluation using disaggregated methods; document results
  • Confirm explainability artifacts are accurate (validate explanations against actual model behavior)
  • Schedule MEASURE 1.3 independent review for any Tier 1 or Tier 2 AI models

Days 61–90: Implement MEASURE 3 and 4

  • Stand up the AI risk incident log for MEASURE 3.1
  • Document limitations for risks in MEASURE 3.2 (hallucination, complex GenAI behaviors)
  • Ensure complaint/feedback channels feed MEASURE 3.3
  • Run a meta-evaluation of your measurement program against MEASURE 4.1–4.3 criteria

The “So What” for Your Exam

Regulators aren’t yet citing NIST AI RMF subcategory numbers in examination findings. But they’re asking questions that map directly to MEASURE:

  • “What testing did you do before this model went live?” → MEASURE 2.3
  • “How are you monitoring this model in production?” → MEASURE 2.4
  • “Can you explain the output for this specific customer?” → MEASURE 2.9
  • “Did you test for disparate impact?” → MEASURE 2.11
  • “How do you know your monitoring metrics are the right ones?” → MEASURE 1.1 and 4.3

The organizations that can answer all five of those questions with documented artifacts are the ones that come out of model risk exams with findings marked “informational” rather than MRAs.

The AI Risk Assessment Template includes a model monitoring dashboard template and bias assessment guide built around SR 11-7 and NIST AI RMF alignment — if you need a starting point for the measurement artifacts MEASURE requires.


FAQ

What’s the connection between NIST AI RMF MEASURE and NIST AI 600-1?

NIST AI 600-1 (the Generative AI Profile) adds specificity to MEASURE for GenAI systems. It introduces GenAI-specific risk categories — confabulation, data privacy, homogenization, value chain impacts — that require measurement approaches beyond what MEASURE 2’s general subcategories cover. If you’re deploying any LLM-based systems, you’ll need AI 600-1 alongside the core MEASURE playbook.

How often should we run MEASURE 2.11 bias evaluations?

At minimum annually for all AI models used in consumer decision-making. For high-risk models (credit underwriting, fraud scoring with consumer impact, collection prioritization), quarterly is the defensible frequency given SR 11-7’s model monitoring expectations. For models with known demographic sensitivity, some organizations run monthly disaggregated evaluation as a condition of continued deployment approval.

Do we need to document MEASURE 2.12 (environmental impact) for financial services AI?

Currently no financial services regulator in the US requires environmental impact assessment for AI systems specifically. But MEASURE 2.12 is still worth including in documentation for larger-scale GenAI deployments — particularly if your institution has public ESG commitments. Documenting it demonstrates comprehensive governance even where it’s not yet required.

Can we use our existing model validation process to satisfy MEASURE?

Partially. SR 11-7 model validation covers the core of MEASURE 2.1–2.5. But most existing SR 11-7 validation processes don’t systematically address MEASURE 2.10 (privacy), 2.11 (bias with the NIST-specified harm categories), or 2.12 (environmental impact). If you’re using an SR 11-7 framework as your baseline, layer MEASURE 2.10–2.12 on top of it rather than rebuilding from scratch.

What should we do when a MEASURE evaluation produces results outside acceptable thresholds?

MEASURE doesn’t prescribe the response — that’s MANAGE’s job. But MEASURE 2.6 requires safety evaluation documentation that includes course-correction suggestions for when systems exceed acceptable limits. In practice: pre-define your escalation triggers (e.g., “if demographic parity drops below 0.80, trigger independent review”), document them in your TEVV methodology, and make sure the MANAGE function has a response protocol ready before the threshold is breached, not after.

Frequently Asked Questions

What is the MEASURE function in the NIST AI RMF?
MEASURE is the third function in the NIST AI RMF (after GOVERN and MAP). It applies quantitative, qualitative, or mixed-method tools to analyze, assess, benchmark, and monitor AI risk. MEASURE has four main categories: selecting appropriate methods (1), evaluating trustworthy AI characteristics across 13 subcategories (2), tracking emergent risks (3), and feeding measurement results back into the program (4).
What is TEVV in the NIST AI RMF?
TEVV stands for Test, Evaluation, Validation, and Verification. Under the NIST AI RMF MEASURE function, TEVV describes the structured process of testing AI system behavior before and after deployment: Testing (structured pre-deployment examination), Evaluation (ongoing performance assessment against baseline), Validation (confirming the system is appropriate for its intended purpose), and Verification (confirming the system meets its technical specifications).
What does NIST AI RMF MEASURE 2.11 require for bias testing?
MEASURE 2.11 requires organizations to document fairness and bias evaluation results. This includes identifying harm types (allocational, representational, quality of service, stereotyping, and erasure harms), analyzing disparities across and within demographic groups including intersecting groups, using pre-processing data transformations, in-processing model adjustments, and post-processing techniques as appropriate, and employing disaggregated evaluation methods by race, age, gender, ethnicity, ability, and region.
How does the NIST AI RMF MEASURE function relate to SR 11-7 model validation?
MEASURE aligns closely with SR 11-7 model validation requirements. MEASURE 2.1-2.5 map directly to SR 11-7's conceptual soundness, ongoing monitoring, and outcomes analysis requirements. MEASURE 2.11 (bias testing) extends SR 11-7 for AI-specific fairness concerns. SR 11-7 requires independent validation; MEASURE 1.3 explicitly requires independent expert involvement. Together they form a defensible model risk governance structure.
What metrics should financial services firms track under the NIST AI RMF MEASURE function?
Financial services firms should track: accuracy, false positive/negative rates, and prediction latency (performance); demographic parity and equal opportunity rates by protected class (fairness); distribution shift and population stability index (drift); feature importance stability over time (model integrity); explainability coverage (what percent of decisions can be explained at required granularity); and incident counts by severity tier. Thresholds for each should be defined during MAP and documented with course-correction triggers.
What is the difference between MEASURE 2.3 and MEASURE 2.4 in the NIST AI RMF?
MEASURE 2.3 covers qualitative or quantitative performance measurement conducted under deployment-like conditions — essentially pre-deployment testing with realistic inputs. MEASURE 2.4 covers monitoring of actual system functionality and behavior during production — post-deployment surveillance once the model is live. Both are required; 2.3 happens before go-live, 2.4 is continuous afterward.
Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.