AI System Evaluations

March 27, 2024

Earned Trust through AI System Assurance

Transparency and disclosures regarding AI systems are primarily valuable insofar as they feed into accountability.¹⁷² One essential tool for converting information into accountability is critical evaluation of the AI system. The National Artificial Intelligence Advisory Committee (NAIAC), in its 2023 report, observed that “practices, standards, and frameworks for designing, developing, and deploying trustworthy AI are created in organizations in a relatively ad hoc way depending on the organization, sector, risk level, and even country.”¹⁷³ We agree with its accompanying observation that it is problematic that “[r]egulations and standards are being proposed that require some form of audit or compliance, but without clear guidance accompanying them.”¹⁷⁴

The RFC described different types of evaluation, including audits, impact and risk assessments, and pre-release certifications. Commenters were divided on whether independent audits are possible now, before there are agreed upon criteria for all aspects. They also questioned whether audits should be mandated.¹⁷⁵ Some comments reflected a sense of frustration with decades of self-regulation of technology that has failed to meet societal expectations for risk management and accountability.¹⁷⁶ At the same time, other commenters noted that audit practices (whether required or not) can result in rote checklist compliance, industry capture, and audit-washing.¹⁷⁷

The scope and use of audits in accountability structures should depend on the risk level, deployment sector, maturity of relevant evaluation methodologies, and availability of resources to conduct the audits. Audits are probably appropriate for any high-risk application or model. At the very least, audits should be capable of validating claims made about system performance and limitations as well as governance controls. Where audits seek to assure a broader range of trustworthy AI attributes, they should ideally use replicable, standardized, and transparent methods. We recommend below that audits be required, regulatory authority permitting, for designated high-risk AI systems and applications and that government act to support a vigorous ecosystem of independent evaluation. We also recommend that audits incorporate the requirements in applicable standards that are recognized by federal agencies. Designating what counts as high risk outside of specific deployment or use contexts is difficult. Nevertheless, OMB has designated in draft guidance for federal agencies presumptive categories of rights-impacting and safety-impacting AI systems, while providing for exemptions depending on context.¹⁷⁸ This is a promising approach to creating risk buckets for AI systems generally.

¹⁷² See, e.g., Generally Intelligent Comment at 4 (cautioning that disclosure requirements without consequence can be a “decoy”); Cordell Institute for Policy in Medicine & Law Comment at 2 (with reference to “[a]udits, assessments and certifications,” cautioning that “[m]ere procedural tools will fail to create meaningful trust and accountability without a backdrop of strong, enforceable consumer and civil rights protections.”); Mike Ananny and Kate Crawford, “Seeing Without Knowing: Limitations of the Transparency Ideal and its Application to Algorithmic Accountability,” New Media & Society, Vol. 20, Iss. 3, at 977-982 (December 13, 2016) (describing ten “[l]imits of the transparency ideal”: that “[t]ransparency can be disconnected from power,” “[t]ransparency can be harmful,” “[t]ransparency can intentionally occlude,” “[t]ransparency can create false binaries,” “[t]ransparency can invoke neoliberal models of agency,” “[t]ransparency does not necessarily build trust,” “[t]ransparency entails professional boundary work,” “[t]ransparency can privilege seeing over understanding,” “[t]ransparency has technical limitations,” and “[t]ransparency has temporal limitations”)..

¹⁷³ National Artificial Intelligence Advisory Committee, Report of the National Artificial Intelligence Advisory Committee (NAIAC), Year 1 (May 2023) at 28.

¹⁷⁴ Id.

¹⁷⁵ Compare Certification Working Group Comment at 21 (recommending mandating “accountability measures” and auditor and researcher access “for high capability AI systems (those that operate autonomously or semi-autonomously and pose substantial risk of harm, including physical, emotional, economic, or environmental harms”) with The American Legislative Exchange Council Comment at 8 (“voluntary codes of conduct, industry-driven standards, and individual empowerment should be preferred over government regulation in emerging technology.”).

¹⁷⁶ The AFL-CIO Technology Institute Comment at 5 (“Self-regulatory, self-certifying, or self-attesting accountability mechanisms are insufficient to provide the level of protection workers, consumers, and the public deserve. Certifications generally only determine whether the development of the AI product or service has followed a promised set of guidelines, typically established by the developer or company or industry body.”); Center for American Progress Comment at 16 (“In order to get private companies to conduct these assessments and audits, mechanisms must directly impact what developers care about most and be aligned with the for-profit incentives driving their rapid technological development. For these reasons, voluntary measures are insufficient. Government action (such as formal rulemaking, executive orders, and new laws) are clearly needed; we cannot allow the Age of AI to be another age of self-regulation.”).

¹⁷⁷ Mozilla Comment at 6 (“[I]t is important to untangle incentives in the auditing ecosystem — only where the incentive structure is right and auditors are sufficiently independent (and have sufficient access) can there be more certainty that audits aren’t simply conducted for the purpose of “audit-washing”); The Cordell Institute for Policy in Medicine & Law Comment at 2 (Rules built only around transparency and bias mitigation are “’AI half-measures’ because they provide the appearance of governance but fail (when deployed in isolation) to promote human values or hold liable those who create and deploy AI systems that cause harm.”). See also Ellen P. Goodman and Julia Trehu, Algorithmic Auditing: Chasing AI Accountability, 39 Santa Clara High Tech L. J. 289, 302 (2023) (coining the term “audit-washing” to describe the use of weak audit criteria to effectively misrepresent AI system characteristics, performance, or risks).

¹⁷⁸ See OMB Draft Memo at 24-25.

Program

Artificial Intelligence

Breadcrumb