Sorry, you need to enable JavaScript to visit this website.
Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.

Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.

The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Independent Evaluations

March 27, 2024
Earned Trust through AI System Assurance

Self-assessments (including impact or risk assessments) have a different value proposition than independent evaluations, including audits. Both are important.226 Self-assessments will often be the starting point for the performance of independent evaluations.

Many commenters thought that entities developing and deploying AI should conduct self-assessments, ideally working from the NIST AI RMF.227 An entity’s own assessment of the trustworthiness of AI systems (in development or deployment) benefits from its access to relevant material.228 Moreover, internal evaluation practices will tend to improve management of AI risks by measuring practices against “established protocols designed to support an AI system’s trustworthiness.”229 The degree to which internal evaluations move the needle on AI system performance and impacts depends on how those evaluations are communicated within the AI actor entity and how much management cares about them.

As a practical matter, internal evaluations are more mature and robust currently than independent evaluations, making them appropriate for many AI actors.230 According to one commenter, “combining AI assessments into existing accountability structures where possible has many advantages and should likely be the default model.”231

That said, self-assessments are unlikely to be sufficient. Independent evaluations have proven to be necessary in other domains and provide essential checks on management’s own assessments. Internal evaluations are often not made public; indeed, pressure on firms to open themselves to external scrutiny may well be counter-productive to the goal of rigorous self-examination.232 But entities evaluating themselves may be more forgiving than external evaluators. As one commenter posited “[a]llowing developers to certify their own software is a clear conflict of interest.”233 Independence is crucial to sustain public trust in the accuracy and integrity of evaluation results and is foundational to auditing in other fields.234

There are many good reasons to push for independent evaluations, as well as a number of obstacles. Independent evaluations styled as audits will require audit and auditor criteria. To the extent that auditors could be held liable for false assurance, as they are in the financial sector, one commenter thought that audits of AI systems should hew as closely as possible to a binary yes-no inquiry.235 In the absence of consensus standards, the process may take the form of a multi-factored analysis.236 In either case, but especially in a multi-factored evaluation, disclosure of audit scope and methodology is critical to enable comprehension, comparison, and credibility.237 Transparency around the audit inquiry is all the more important when benchmarks are varied and not standardized, and when audits are diverse in scope and method.

Based on our review of the record and the relevant literature, we think that the following should be part of an audit, although these recommendations are by no means exhaustive. The first element stands alone for audits fashioned as claim validation or substantiation exercises. Most of the elements below align with action items contained in the NIST AI RMF Playbook.238

Is the system fit for purpose in its intended, likely, or actual deployment context? Are the processes, controls, and performance of the system as claimed?

Has the system mitigated risks to a sufficient degree according to independent evaluators and/or appropriate benchmarks?

Is the data used in the system’s design, development, training, testing, and operation:

  • Of adequate provenance and quality;
  • Of adequate relevance and breadth; and
  • Governed by adequate data governance standards?

Are there adequate controls in the entity developing or deploying the system:

  • To ensure worker, consumer, community and other stakeholder perspectives were adequately solicited and incorporated in the development, deployment, post-deployment review, and/or modification process;
  • To ensure periodic monitoring and review of the system’s operation;
  • To ensure adequate remediation of any new risks; and
  • To ensure that there is internal review by a sufficiently empowered decisionmaker not directly involved in the system’s development or operation?

  • Was there appropriate and sufficient documentation throughout the lifecycle of the AI system and its components to enable an evaluator to answer the previous questions?
  • Has the developer or deployer made sufficient disclosure about the use of AI, and about training data, system characteristics, outputs, and limitations, to stakeholders, including in plain language?
  • Is the AI system sufficiently interpretable and explainable that stakeholders can interrogate whether its outputs are justified?
  • Is the developer or deployer adequately contributing to an adverse incident database?

 


226 See, e.g., Holistic AI Comment at 4 (“While certifications function as public-facing documentation on, for example, a system’s level of reliability and thus safety, internal assessments help to improve a system at the R&D level, directly guiding better decision-making and best practices across the conceptualization, design, development, and management and monitoring of a system”); Id. at 5 (“[I]nternal assessments of performance according to clearly delineated criteria are necessary for internal purposes as much as for providing the documentation trail (e.g. logs, databases, registers) of evidence of system performance for external independent and impartial auditing”); Responsible AI Institute Comment at 4-5 (table showing tradeoffs among different types of evaluations).

227 See, e.g., IBM Comment at 3 (“All entities deploying an AI system should conduct an initial high-level assessment of the technology’s potential for harm. Such assessments should be based on the intended use-case application(s), the number and context of end-user(s) making use of the technology, how reliant the end-user would be on the technology, and the level of automation. … For those high-risk use cases, the assessment processes should be documented in detail, be auditable, and retained for a minimum period of time.”); Microsoft Comment at 5 (“In the context of accountability, the NIST AI RMF also highlights the value of two important practices for high-risk AI systems: impact assessments and red-teaming. Impact assessments have demonstrated value in a range of domains, including data protection, human rights, and environmental sustainability, as a tool for accountability.”); Workday Comment at 1.

228 Toby Shevlane et al., Model evaluation for extreme risks, arXiv (May 24, 2023), at 6. See also ARC Comment at 5 (Internal evaluations are necessary when entities cannot easily or securely provide sufficient access, but then “it is critical that AI labs conducting internal audits state publicly what dangerous capabilities they are evaluating their AI models for, how they are conducting those evaluations, and what actions they would take if they found that their AI models exhibited dangerous capabilities.”).

229 See PWC Comment at A3. See also Holistic AI Comment at 5 (“[I]nternal assessments of performance according to clearly delineated criteria are necessary for internal purposes as much as for providing the documentation trail (e.g. logs, databases, registers) of evidence of system performance for external independent and impartial auditing”); Responsible AI Institute Comment at 4 (certifications, audits, and assessments promote trust by enabling verification and can change internal processes).

230 For comments discussing the readiness of internal assessments vs. the immaturity of external assessment standards, see Information Technology Industry Council (ITI) Comment at 4-5; TechNet Comment at 3; BSA | The Software Alliance Comment at 2; Workday Comment at 1; U.S. Chamber of Commerce Comment at 2.

231 DLA Piper Comment at 9.

232 BSA | The Software Alliance Comment at 4 (noting that mandating public disclosure of internal assessments would change incentives for firms “and result in less thorough examinations that do not surface as many issues”); American Property Casualty Insurance Association Comment at 3 (public disclosure of internal assessments can inhibit full review).

233 IEEE Comment at 3-4.

234 Trail of Bits Comment at 2.

235 See, e.g., ForHumanity Comment at 7.

236 See e.g. Global Partners Digital Comment at 4 (“HRIA methodologies must be adapted to best fit the needs of external stakeholders and must be responsive to the specific contexts" OR 'human rights due diligence or HRIAs critically require ensuring meaningful participation in the risk identification and comments about the impacts, its severity and likelihood, and development of harm prevention and mitigation measures from potentially affected groups and other relevant stakeholders in the context of implementation of the AI system under evaluation.); Center for Democracy & Technology Comment at 26 (“[Human rights impact assessments] are intended to identify potential impacts of an AI system on human rights ranging from privacy and non-discrimination to freedom of expression and association.”).

237 See e.g. Mozilla Open Source Audit Tooling (OAT) Comment at 7; ARC Comment at 5.

238 NIST AI RMF Playbook.