NIST AI RMF MEASURE Function: TEVV, Bias Testing, and Metrics That Actually Matter
Table of Contents
TL;DR:
- MEASURE is the NIST AI RMF’s “show your work” function — where MAP’s risk context becomes quantified evidence
- TEVV (Test, Evaluation, Validation, Verification) is the structured methodology that powers MEASURE, covering everything from pre-deployment testing to continuous production monitoring
- MEASURE 2 has 13 subcategories covering all seven trustworthy AI characteristics — each requires documented outputs, not just processes
- For financial services, MEASURE maps directly to SR 11-7 model validation — but it extends it for bias, privacy, and environmental risk in ways SR 11-7 doesn’t address
Most AI governance teams treat the NIST AI RMF like a compliance checklist. They build an inventory (GOVERN), do a risk classification (MAP), and then move on — skipping MEASURE because “we have model validation.”
That’s a problem. MEASURE is where governance stops being theoretical. It’s where you move from “we identified these AI risks during MAP” to “here’s our quantified evidence that those risks are within tolerance.” Without MEASURE, you have a risk program built on declarations. With it, you have one built on evidence.
Regulators increasingly know the difference.
What MEASURE Is — and Where It Sits
The NIST AI RMF has four functions: GOVERN (culture, accountability, policy), MAP (context framing, risk classification), MEASURE (quantification and analysis), and MANAGE (treatment, response, recovery). They’re not a linear sequence — the framework is designed as a continuous loop. But understanding the sequence helps.
MAP creates the context: what is this AI system’s purpose, who uses it, what are the potential harms, and which risks are categorized as significant? MEASURE takes that context and asks: how do we know if those risks are actually within tolerance? What evidence would convince a skeptical examiner?
The GOVERN function establishes the accountability structures and policies. MEASURE operationalizes them.
MEASURE is built around four main categories:
- MEASURE 1 — Selecting appropriate methods and metrics for identified risks
- MEASURE 2 — Evaluating all seven trustworthy AI characteristics across 13 subcategories
- MEASURE 3 — Tracking emergent and ongoing risks
- MEASURE 4 — Feeding measurement efficacy back into the program
TEVV: The Engine of MEASURE
TEVV stands for Test, Evaluation, Validation, and Verification. It’s the structured methodology NIST uses to operationalize MEASURE, and it runs across all four categories.
Here’s how each component maps to AI governance activities:
| TEVV Component | When It Happens | What It Covers |
|---|---|---|
| Testing | Pre-deployment | Structured examination of model behavior using defined test sets, adversarial inputs, and edge cases |
| Evaluation | Ongoing (pre + post) | Performance assessment against baseline metrics, benchmark comparisons, and demographic analysis |
| Validation | Pre-deployment + major changes | Confirming the model is appropriate for its intended purpose and deployment context |
| Verification | Pre-deployment | Confirming the model meets its technical specifications and design requirements |
In financial services, TEVV maps onto the SR 11-7 model validation lifecycle: conceptual soundness review (Validation), outcome analysis (Evaluation), sensitivity analysis (Testing), and ongoing monitoring (Evaluation + Testing continuously). But TEVV extends SR 11-7 in two important directions: it explicitly covers bias evaluation (which SR 11-7 addresses only implicitly) and it includes environmental impact assessment (which SR 11-7 doesn’t address at all).
MEASURE 2.1 requires organizations to document the specific test sets, metrics, and TEVV tools they use. This is the artifact that examiners will ask for.
MEASURE 1: Choosing Methods That Match the Risk
Before any evaluation starts, you have to select appropriate measurement approaches for the risks you identified in MAP. That sounds obvious, but it’s where many programs go wrong.
MEASURE 1.1 requires selecting measurement approaches for significant AI risks — and documenting which risks cannot yet be measured with available techniques. That second part matters. “We identified this risk but have no reliable way to quantify it currently” is an acceptable answer in NIST AI RMF, as long as it’s documented. An undocumented unmeasurable risk is a governance gap. A documented one is a limitation disclosure.
MEASURE 1.2 requires regularly assessing whether your chosen metrics are still appropriate and whether controls are still effective. A metric that was valid at deployment may stop being valid as the model’s operating context changes. This feeds directly into your model risk tiering review cycle.
MEASURE 1.3 requires involvement of independent experts and domain specialists in evaluations — particularly for high-risk AI systems. This is the NIST AI RMF’s version of SR 11-7’s independent validation requirement. For financial institutions already running formal model validation, this is table stakes. For fintechs without dedicated model risk teams, it means commissioning external validation for your highest-risk models at least annually.
MEASURE 2: The 13 Trustworthy Characteristics
This is the core of MEASURE — where the seven trustworthy AI characteristics (valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed) get quantified and documented. MEASURE 2 has 13 subcategories:
| Subcategory | What It Requires | Output |
|---|---|---|
| 2.1 | Document test sets, metrics, and TEVV tools | TEVV methodology documentation |
| 2.2 | Human subject evaluations meeting protection requirements and population representativeness | Evaluation protocol + IRB compliance (if applicable) |
| 2.3 | Performance measurement under deployment-like conditions | Pre-deployment test results |
| 2.4 | Monitor system functionality and behavior during production | Ongoing monitoring dashboard/log |
| 2.5 | Demonstrate validation and reliability, document generalizability limitations | Validation report with scope limitations |
| 2.6 | Regular safety risk evaluation, demonstrating safe operation within risk tolerance | Safety evaluation log |
| 2.7 | Document security and resilience evaluation | Adversarial testing + red team results |
| 2.8 | Examine transparency and accountability risks | Accountability documentation, audit trails |
| 2.9 | Document model explanation, validation, and output interpretation | Explainability artifacts (SHAP values, feature importance, etc.) |
| 2.10 | Examine and document privacy risks | Privacy impact assessment |
| 2.11 | Document fairness and bias evaluation results | Bias testing documentation with disaggregated results |
| 2.12 | Assess environmental impact and sustainability | Carbon/energy metrics (where material) |
| 2.13 | Evaluate TEVV metrics and process effectiveness | Meta-evaluation of the measurement program itself |
For financial services, the highest-stakes subcategories are 2.3, 2.4, 2.9, and 2.11.
MEASURE 2.3 and 2.4: Pre-Deployment vs. Production
MEASURE 2.3 covers testing under deployment-like conditions before go-live. For a credit decisioning model, this means running the model against a holdout dataset that reflects the actual demographic distribution of your expected borrower population — not a sanitized development set. The results need to be documented and reviewed before deployment approval.
MEASURE 2.4 is the continuous post-deployment monitoring requirement. This is where drift detection and continuous monitoring live. Minimum expectations: track accuracy and error rates monthly, flag deviations beyond defined thresholds, and document interventions. For high-risk models (credit, fraud, underwriting), quarterly comprehensive reviews are the baseline expectation in SR 11-7-aligned programs.
MEASURE 2.9: Explainability Isn’t Optional
Regulators in financial services have been explicit: AI models used in consumer lending must produce explanations specific enough to satisfy Regulation B adverse action notice requirements. MEASURE 2.9 requires documenting your explanation methodology and validating that outputs are accurate — meaning the explanations actually reflect what the model considered.
The documentation artifact here matters. “We use SHAP values” is not documentation. “We use SHAP values, our explanation validation confirms they correlate with actual model outputs at r=0.94, and here’s our mapping from SHAP to adverse action notice language” is documentation.
MEASURE 2.11: Bias Testing Requirements
MEASURE 2.11 is one of the most operationally demanding subcategories. It requires documenting fairness and bias evaluation results — not just running the tests, but capturing what you found and what you did about it.
NIST identifies five harm types that bias testing should address:
| Harm Type | What It Means | Financial Services Example |
|---|---|---|
| Allocational | System allocates resources or opportunities unfairly | Credit denials disparately impacting protected classes |
| Representational | System stereotypes or demeans certain groups | Marketing AI targeting based on inferred demographics |
| Quality of service | System performs differently by group | Fraud models with higher false positive rates for certain ethnicities |
| Stereotyping | System reinforces harmful generalizations | Customer service AI treating certain names differently |
| Erasure | System renders certain groups invisible | Underwriting models trained on non-representative data |
Required documentation under MEASURE 2.11:
- Demographic parity analysis (overall approval/denial rates by protected class)
- Equal opportunity analysis (true positive rates by protected class)
- Disparate impact testing at the 80% four-fifths rule threshold
- Intersectional analysis (not just race OR gender, but race AND gender)
- Results by geographic region if operating in multiple markets
- Methodology for any disparities found to be statistically significant
- Remediation actions taken, if any
This extends and complements what’s required under SR 11-7 for AI model validation. SR 11-7 requires ongoing monitoring; MEASURE 2.11 specifies the types of bias analysis required within that monitoring.
For detailed testing methodologies — disparate impact analysis, counterfactual fairness, calibration testing — the bias testing guide covers the statistical methods in depth.
MEASURE 3: Tracking Emergent Risks
Risks don’t stay static. A model that passed all MEASURE 2 tests at deployment may develop new failure modes as data distributions shift, as new use patterns emerge, or as the broader operating environment changes.
MEASURE 3.1 requires ongoing identification and tracking of existing and emergent risks. In practice, this means maintaining a living risk log for each AI system — distinct from the model inventory — that captures newly observed behaviors, near-misses, edge cases encountered in production, and any changes in the deployment context that could affect risk.
MEASURE 3.2 requires documenting approaches for risks that are difficult to assess with current techniques. Hallucination in large language models is the obvious current example: we know it’s a risk, but reliable measurement techniques are still maturing. Documenting “we cannot yet reliably quantify hallucination frequency; we mitigate through human-in-the-loop review for all high-stakes outputs” is the right answer.
MEASURE 3.3 requires feedback processes that let end users report problems and appeal outcomes. For consumer-facing AI in financial services, this links directly to complaint management and adverse action appeal processes. If your credit model denies someone, and they appeal, that appeal process generates data that feeds MEASURE 3.1’s emergent risk tracking.
MEASURE 4: Does Your Measurement Program Actually Work?
MEASURE 4 is the meta-level — assessing whether the measurement activities themselves are valid and effective.
MEASURE 4.1 requires that measurement approaches be informed by domain experts. For financial services AI, that means the people defining your metrics should include model risk managers, compliance officers, and business line owners — not just data scientists.
MEASURE 4.2 requires that deployment context results be informed by relevant stakeholders. If your credit model is deployed in a community bank context, community feedback on model performance should be captured and reviewed — not just internal metrics.
MEASURE 4.3 requires documenting performance improvements or declines based on field data. This is the learning loop: what did production data tell you that your pre-deployment testing didn’t? How are you incorporating that into model development and validation cycles?
MEASURE in Financial Services: Practical Implementation
Here’s a realistic 90-day build-out for a financial services team standing up MEASURE compliance for the first time:
Days 1–30: Method Selection and Documentation
- Complete MEASURE 1.1: For each significant AI risk identified in MAP, document the measurement approach and accept/document unmeasurable risks
- Inventory existing test sets and performance metrics; identify gaps
- Define performance thresholds with documented course-correction triggers
Days 31–60: Build Out MEASURE 2 Documentation
- Prioritize subcategories by regulatory exposure: start with 2.3 (pre-deployment testing), 2.4 (production monitoring), 2.9 (explainability), and 2.11 (bias testing)
- Run initial bias evaluation using disaggregated methods; document results
- Confirm explainability artifacts are accurate (validate explanations against actual model behavior)
- Schedule MEASURE 1.3 independent review for any Tier 1 or Tier 2 AI models
Days 61–90: Implement MEASURE 3 and 4
- Stand up the AI risk incident log for MEASURE 3.1
- Document limitations for risks in MEASURE 3.2 (hallucination, complex GenAI behaviors)
- Ensure complaint/feedback channels feed MEASURE 3.3
- Run a meta-evaluation of your measurement program against MEASURE 4.1–4.3 criteria
The “So What” for Your Exam
Regulators aren’t yet citing NIST AI RMF subcategory numbers in examination findings. But they’re asking questions that map directly to MEASURE:
- “What testing did you do before this model went live?” → MEASURE 2.3
- “How are you monitoring this model in production?” → MEASURE 2.4
- “Can you explain the output for this specific customer?” → MEASURE 2.9
- “Did you test for disparate impact?” → MEASURE 2.11
- “How do you know your monitoring metrics are the right ones?” → MEASURE 1.1 and 4.3
The organizations that can answer all five of those questions with documented artifacts are the ones that come out of model risk exams with findings marked “informational” rather than MRAs.
The AI Risk Assessment Template includes a model monitoring dashboard template and bias assessment guide built around SR 11-7 and NIST AI RMF alignment — if you need a starting point for the measurement artifacts MEASURE requires.
FAQ
What’s the connection between NIST AI RMF MEASURE and NIST AI 600-1?
NIST AI 600-1 (the Generative AI Profile) adds specificity to MEASURE for GenAI systems. It introduces GenAI-specific risk categories — confabulation, data privacy, homogenization, value chain impacts — that require measurement approaches beyond what MEASURE 2’s general subcategories cover. If you’re deploying any LLM-based systems, you’ll need AI 600-1 alongside the core MEASURE playbook.
How often should we run MEASURE 2.11 bias evaluations?
At minimum annually for all AI models used in consumer decision-making. For high-risk models (credit underwriting, fraud scoring with consumer impact, collection prioritization), quarterly is the defensible frequency given SR 11-7’s model monitoring expectations. For models with known demographic sensitivity, some organizations run monthly disaggregated evaluation as a condition of continued deployment approval.
Do we need to document MEASURE 2.12 (environmental impact) for financial services AI?
Currently no financial services regulator in the US requires environmental impact assessment for AI systems specifically. But MEASURE 2.12 is still worth including in documentation for larger-scale GenAI deployments — particularly if your institution has public ESG commitments. Documenting it demonstrates comprehensive governance even where it’s not yet required.
Can we use our existing model validation process to satisfy MEASURE?
Partially. SR 11-7 model validation covers the core of MEASURE 2.1–2.5. But most existing SR 11-7 validation processes don’t systematically address MEASURE 2.10 (privacy), 2.11 (bias with the NIST-specified harm categories), or 2.12 (environmental impact). If you’re using an SR 11-7 framework as your baseline, layer MEASURE 2.10–2.12 on top of it rather than rebuilding from scratch.
What should we do when a MEASURE evaluation produces results outside acceptable thresholds?
MEASURE doesn’t prescribe the response — that’s MANAGE’s job. But MEASURE 2.6 requires safety evaluation documentation that includes course-correction suggestions for when systems exceed acceptable limits. In practice: pre-define your escalation triggers (e.g., “if demographic parity drops below 0.80, trigger independent review”), document them in your TEVV methodology, and make sure the MANAGE function has a response protocol ready before the threshold is breached, not after.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Frequently Asked Questions
What is the MEASURE function in the NIST AI RMF?
What is TEVV in the NIST AI RMF?
What does NIST AI RMF MEASURE 2.11 require for bias testing?
How does the NIST AI RMF MEASURE function relate to SR 11-7 model validation?
What metrics should financial services firms track under the NIST AI RMF MEASURE function?
What is the difference between MEASURE 2.3 and MEASURE 2.4 in the NIST AI RMF?
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
GenAI Supply Chain Risk: Third-Party Model Dependencies and NIST AI 600-1 Controls
Most financial institutions using GenAI APIs don't fully own their AI supply chain. NIST AI 600-1 says that's your problem. Here's what you need to control.
Apr 25, 2026
AI RiskDeveloper vs. Deployer vs. Operator: Role-Specific Obligations Under NIST AI 600-1
NIST AI 600-1 assigns different GenAI risk obligations to developers, deployers, and operators. Here's what each role actually owns—and where the gaps live.
Apr 25, 2026
AI RiskGenerative AI Incident Disclosure and Content Provenance: NIST AI 600-1 Requirements
What NIST AI 600-1 requires when your GenAI system fails: incident disclosure obligations, after-action review requirements, and content provenance tracking.
Apr 24, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.