Sorry, you need to enable JavaScript to visit this website.
Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.

Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.

The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Purpose of Evaluation

March 27, 2024
Earned Trust through AI System Assurance

AI system evaluations are useful to:

  • Improve internal processes and governance;179
  • Provide assurance to external stakeholders that AI systems and applications are trustworthy;180  and
  • Validate claims of trustworthiness.181

One purpose of an evaluation is claim validation. The goal of such an inquiry is to verify or validate claims made about the AI system, answering the question: Is the AI system performing as claimed with the stated limitations? The advantage of scoping an evaluation like this is that it is more amenable to binary findings, and there are often clear enforcement mechanisms and remedies to combat false claims in the commercial context under federal and state consumer protection laws.

Another type of evaluation examines the AI system according to a set of criteria independent of an AI actor’s claims. Such an evaluation might have a narrow aperture, focusing on the critical determination of how accurately a system performs its task or whether it produces unlawfully discriminatory outputs, for example.182 Or it might go broader, focusing on governance and system architecture, but only for a small subset of objectives, such as protecting intellectual property.183 In theory, an evaluation can also be comprehensive, looking at governance, architecture, and applications with respect to the management of all identified risks such as robustness, bias, privacy, intellectual property infringement, explainability, and efficacy.184

Commenters proposed various subjects for evaluations. The following is our synthesis of the most frequent mentions:

  • System performance and impact:
    • Verification of claims, including about accuracy, fairness, efficacy, robustness, fitness for purpose.
    • Legal and regulatory compliance.
    • Protection for human and civil rights, labor, consumers, and children .
    • Data protection and privacy.
    • Environmental impacts.
    • Security.
  • Processes:
    • Risk assessment and management, continuous monitoring, mitigation, process controls, and adverse incident reporting.
    • Data management, including provenance, quality, and representativeness.
    • Communication and transparency, including documentation, disclosure, and explanation.
    • Human control and oversight of the AI system and outputs, as well as human fallback for individuals impacted by system outputs.
    • By-design efforts towards trustworthiness throughout the AI system lifecycle.
    • Incorporation of stakeholder participation.

We heard from many that evaluations must include perspectives from marginalized communities185 and reflect the “inclusion of a diverse range of interests and policy needs.”186 One commenter argued that frameworks for environmental impact assessments, which “mandate public participation ‘by design,’” should be considered in this context.187

All evaluations require measurement methodologies, which auditors are deploying in the field.188 There are technical questions about how to test for certain harms like unlawful discrimination, including how to design the evaluation and what test data to use. What counts as problematic discrimination is a normative question that will be determined by the relevant law and norms in the domain of application (e.g., housing, employment, financial). As discussed below, the pace of standards development may lag behind the need for evaluation, in which case those conducting necessary evaluations will have to earn trust on the basis of their criteria and methodology.

Commenters thought that the type of independent evaluation called for should be pegged to the risk level of the AI system.189 There was strong support for conducting such evaluations on an ongoing basis throughout the AI system lifecycle, including the design, development, and deployment stages.190 As entities develop AI systems or system components, and as entities then produce AI system outputs, every node in that chain should bear responsibility for assuring its part in relation to trustworthy AI. This is ideally how it works in the financial value chain, with organizations (e.g., payroll processors or securities market valuators) relying on, and in turn providing, audited financial statements and reports describing processes and controls. As one commenter stated, these communications “explicitly acknowledge the interrelationship between the controls of the service organization and the end user.”191

It is generally desirable for independent evaluations to use replicable methods,192 and to present the results in standardized formats so as to be easily consumed and acted upon.193 But given how vastly different deployments can be – for example, automated vehicles versus test scoring – some aspects of AI evaluations will have to be conducted differently depending on the sector.194 Evaluations of foundation models, where use cases may be diverse and unpredictable, have their own challenges. Moreover, trade secret protection for information that is evaluated may make replicability difficult.

It will take time for the evaluation infrastructure to mature as the methodologies and criteria emerge.195 One possible outcome of standardization, discussed below, would be a modular approach to evaluations, which would recognize parent standards (e.g., for examining specific processes, attributes, or risks) and then recognize additional standards as applicable to the product being audited to craft overall evaluations suitable for the relevant industry sector or type of model. Standardization efforts that are well funded and coordinated across sectors could achieve a baseline of common-denominator elements, supplemented by modules adapted for the application domain or for foundation models.

 


179 See, e.g., CAQ Comment at 6 (“Ultimately, the performance of robust risk assessment and development of processes and controls increases internal accountability and leads to improvements in the quality of information reported externally”); Ernst & Young Comment at 4 (“The value of verification schemes in the context of AI accountability can have both external and internal benefits for an organization. While they can contribute to promoting trust among external stakeholders such as customers, users and the public, they also play a role in identifying potential weaknesses in internal processes in organizations and strengthening those internal processes.”);.

180 See, e.g., Unlearn.AI Comment at 1; Responsible AI Institute Comment at 4; Intel Comment at 3..

181 See, e.g., Trail of Bits Comment at 1 (Audits should assess performance against verifiable claims as opposed to accepted benchmarks); PWC Comment at A1 (“[T]rust in Artificial Intelligence (AI) systems and the data that feeds them may ultimately be achieved through a two-pronged system: (1) a management assertion on compliance with the applicable trustworthy AI standard or framework and (2) third-party assurance on management’s assertion.”).

182 See, e.g., Salesforce Comment at 5 (recommending that impact assessments be used to counter bias in hiring); AI Audit Comment at 3-4; U.S. Equal Employment Opportunity Commission, Testimony of Suresh Venkatasubramanian (Jan. 31, 2023) (recommending that entities using AI for hiring conduct mandatory “disparity assessments to determine how their systems might exhibit unjustified differential outcomes [and] mitigate these differential outcomes as far as possible with the result of this assessment and mitigation made available for review.”).

183 See, e.g., Association of American Publishers (AAP) Comment at 4-5 (“AI technologies should be audited as to whether the material used to create the training data sets was legitimately sourced, and whether appropriately licensed from or its use authorized by the copyright owner or rights holder.”).

184 Lumeris Comment at 3 (adding consideration of human fallback and governance); ForHumanity Comment at 6 (adding consideration of cybersecurity, lifecycle monitoring, human control); Holistic AI Comment at 4. See also Inioluwa Deborah Raji, Sasha Costanza-Chock, and Joy Buolamwini. “Change From the Outside: Towards Credible Third-Party Audits of AI Systems. Missing Links in AI Policy, ” Missing Links in AI Policy (2022), at 8 (“AI audits can help identify whether AI systems meet or fall short of expectations, whether in terms of stated performance targets (such as prediction or classification accuracy) or in terms of other concerns such as bias and discrimination (disparate performance between various groups of people); data protection, privacy, safety and consent; transparency, explainability and accountability; adherence to standards, ethical principles and legal and regulatory requirements; or labor practices, energy use and ecological impacts.”).

185 See, e.g., ADL Comment at 7 (recommending consideration of “how civil society can advise in the fine-tuning of AI data sets to ensure that AI tools account for context specific to historically marginalized groups and immediate societal risks”).

186 Holistic AI Comment at 11 (“A body of interdisciplinary experts needs to collectively determine best practices, standards and regulations to ensure inclusion of a diverse range of interests and policy needs. This body should be composed of stakeholders beyond, for example, the big technology players of the private sector and large international NGOs; such stakeholders should include smaller technology companies and local civil society organizations given their frontline work with users.”); Global Partners Digital Comment at 7 (the “iterative evaluation” of AI systems must include “the participation of a wide range of stakeholders, including those that are impacted by the system deployment and not only those controlling the system.”); AI & Equality Comment at 6-7 (discussing stakeholder involvement); #ShePersisted Comment at 8-10 (women who are targeted by gender-based violence online should be represented in establishing evaluations for AI systems); Ada Lovelace Institute Comment at 5 (“The long history of environmental impact assessments (emerging under the US NEPA) in policy offers learnings for the potential for impact assessments for AI: frameworks for EIAs mandate public participation ‘by design’ to improve the legitimacy and quality of the EIA and to contribute to normative goals like democratic decision-making”). See also Wesley Hanwen Deng et al., Understanding Practices, Challenge, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice, CHI ’23, ACM Conference on Human Factors in Computing Systems (April 2023), at 1-18.

187 Ada Lovelace Comment at 5. See also Wesley Hanwen Deng et al., Understanding Practices, Challenge, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice, CHI ’23, ACM Conference on Human Factors in Computing Systems (April 2023), at 1-18 (showing difficulties in recruiting user auditors and conducting user-engaged audit reports).

188 See, e.g., O'Neil Risk Consulting & Algorithmic Auditing; Credo AI; Eticas.

189 See, e.g., Responsible AI Institute Comment at 4 (“Generally, the higher the probability and magnitude of potential harms associated with an AI use case, the more likely it is that a rigorous, independent audit will be appropriate”). See also supra Purpose of Evaluation Section.

190 See, e.g., Hitachi Comment at 9 (stressing the need to evaluate frequently); The Future Society Comment at 4; Global Partners Digital Comment at 4.

191 PWC Comment at A7. See also Palantir Comment at 10 (stressing process measures in the AI system development phase, including data collection practices, “access controls, logging, and monitoring for abuse”).

192 See Pattrn Analytics & Intelligence, Evaluating Recommender Systems in Relation to the Dissemination of Illegal and Harmful Content in the UK (July 2023), at 35.

193 See CAQ Comment at 8 (“We believe that a consistent report format is important as it allows users of the report to compare reports across different assurance engagements. Further, the Independent Accountants’ Report provides critical information to users, including the criteria, level of assurance, responsibilities of the auditor and entity management, and any limitations, among other information.”).

194 See, e.g., MITRE Comment at 5 (use “sector regulators” to “adopt and adapt accountability mechanisms tailored to specific AI use case”); Consumer Reports Comment at 28 (“[T]he type of audit that can be executed and the extent to which a researcher is able to assess a model is highly dependent on the information they have access to.”).

195 See Salesforce Comment at 4 (evaluation “tools need to be built on accepted AI definitions, thresholds, and norms that are not yet established in the United States.”).