Public Safety
This section examines the marginal risks and benefits to public safety posed by dual-use foundation models with widely available model weights. As the AI landscape evolves, these risks, benefits, and overall impacts on public safety may shift. The policy recommendations section addresses these challenges.
Risks of Widely Available Model Weights for Public Safety
Dual-use foundation models with widely available mod el weights could plausibly exacerbate the risks AI models pose to public safety by allowing a wider range of actors, including irresponsible and malicious users, to leverage the existing capabilities of these models and augment them to create more dangerous systems.33 For instance, even if the original model has built-in safeguards to prohibit certain prompts that may harm public safety, such as content filters,34 blocklists,35 and prompt shields,36 direct model weight access can allow individuals to strip these safety features.37 While people may be able to circumvent these mechanisms in closed models, direct access to model weights can allow these safety features to be circumvented more easily. Further, these actions are much easier and require fewer resources and technical knowledge than training a new model directly. Such actions may be difficult to monitor, oversee, and control, unless the individual uploads the modified model publicly.38 As with all digital data in the Internet age, the release of mod el weights also cannot feasibly be reversed.
While users can also circumvent safeguards in closed AI models, such as by consulting online information about how to ‘jailbreak’ a model to generate unintended answers (i.e., creative prompt engineering) or, for more technical actors, fine-tuning AI models via APIs,39 methods to mitigate these circumventions for an API-access system, such as moderating data sent to a model and incorporating safety-promoting data during fine-tuning, exist. These same mitigation strategies do not reliably work on AI models with widely available model weights.40 Experimentation with available model weights, while often helpful for research to employ defenses against previously unknown attacks, can also illuminate new channels for malicious actors to exploit proprietary models because open models are easier to manipulate and can share properties with closed models.41
One mitigation that may work is using techniques including tuning a model on distinct objective functions and weakening its ability to produce dangerous information, prior to its weights being made widely available. However, we currently have limited technical understanding of the relative efficacy of different safeguards, and protections available to closed models might end up providing significant additional protection.42
This Report considers two discrete public safety risks discussed in relation to dual-use foundation models with widely available model weights:
- lowering the barrier of entry for non-experts to leverage AI models to design and access information about chemical, biological, radiological, or nuclear (CBRN) weapons, as well as potentially synthesize, produce, acquire, or use them; and
- enabling offensive cyber operations through automated vulnerability discovery and exploitation for a wide range of potential targets.
Chemical, Biological, Radiological, or Nuclear Threats to Public Safety
Widely available model weights could potentially exacerbate the risk that non-experts use dual-use foundation models to design, synthesize, produce, acquire, or use, chemical, biological, radiological, or nuclear (CBRN) weapons.
Open model weights could possibly increase this risk because they are:
- more accessible to a wider range of actors, including actors who otherwise could not develop advanced AI models or use them in this way (either be cause closed models lack these capabilities, or they cannot “jailbreak” them to generate the desired information); and
- easy to distribute, which means that the original model and augmented, offshoot models, as well as instructions for how to exploit them, can be proliferated and used for harm without developer knowledge.
ACCESSIBILITY
This ease of access may enable various forms of CBRN risk. For instance, large language models (LLMs) can generate existing, dual-use information (defined as information that could support creation of a weapon but is not sensitive) or act as chemistry subject matter experts and lab assistants, and LLMs with open model weights specifically can be fine-tuned on domain-specific datasets, potentially exacerbating this risk.43 However, the CBRN-related information open models can generate compared to what users can find from closed models and other easily accessible sources of information (e.g., search engines), as well as the ease of implementing mitigation measures for these respective threats, remains unclear.44
Open dual-use foundation models also potentially increase the level of access to biological design tools (BDT). BDTs can be defined “as the tools and methods that en able the design and understanding of biological processes (e.g., DNA sequences/synthesis or the design of novel organisms).”45 Intentional or unintentional misuse of BDTs introduces the risk that they can create new information, as opposed to large language models’ dissemination of information that is widely available.46 While BDTs exceeding the 10B parameter threshold are just now beginning to appear, sufficiently capable BDTs of any scale should be discussed alongside dual-use foundation models because of their potential risk for biological and chemical weapon creation.47
EASE OF DISTRIBUTION
Some experts have argued that the indiscriminate and untraceable distribution unique to open model weights creates the potential for enabling CBRN activity amongst bad actors, especially as foundation models increase their multi-modal capabilities and become better lab assistants.48 No current models, proprietary or widely available, offer uplift on these tasks relative to open source information resources on the Internet.49 But future models, especially those trained on confidential, proprietary, or heavily curated datasets relevant to CBRN, or those that significantly improve in multi-step reasoning, may pose risks of information synthesis and disclosure.50
FURTHER RESEARCH
Further research is needed to properly address the marginal risk added by the accessibility and ease of distribution of open foundation models. For instance, the risk delta between jailbreaking future closed models for CBRN content and augmenting open models, as well as how the size of the model, type of system, and technical expertise of the actor, may change these calculations remains un clear. Previous evaluations on CBRN risk may not cover all available open models or closed models whose weights could be made widely available.51 Future analysis should distinguish between and treat separately each aspect of chemical, biological, radiological, or nuclear risks associated with open model weights.
Experts must also assess the amount open models in crease this risk in the context of the entire design and development process of CBRN material. Information about how to design CBRN weapons may not be the highest barrier for developing them. Beyond computational de sign, pathogens, toxins, and chemical agents need to be physically generated, which requires expertise and lab equipment to create in the real world.52 Other factors, such as the ease of attaining CBRN material, the incentives for engagement in these activities, and other mitigation measures—i.e., current legal prohibitions on nuclear, biological, and chemical weapons—also determine the extent to which open models introduce a substantive CBRN threat.
Offensive Cyber Operations Risks to Public Safety
Modifying an advanced dual-use foundation model with widely available model weights requires significantly fewer resources than training a new model and may be more plausible than circumventing safeguards on closed models. It is possible that fine tuning existing models on tasks relevant to cyber operations could further aid in con ducting cyberattacks—especially for actors that conduct operations regularly enough to have rich training data and experimentation environments.53
FORMS OF ATTACKS
Cyber attacks that rely on dual-use foundation models with widely available model weights could take various forms, such as social engineering and spear-phishing, mal ware attack generation, and exploitation of other models’ vulnerabilities.
First, open foundation models could enable social engineering (including through voice cloning and the auto mated generation of phishing emails).54 Attacks could also take the form of automated cybersecurity vulnerability detection and exploitation.55 The marginal cybersecurity risk posed by the wide distribution of dual-use foundation models may increase the scale of malicious action, over whelming the capacity of law enforcement to effectively respond.
Cyber-attackers could also potentially leverage open models to automatically generate malware attacks and develop more sophisticated malware, such as viruses,56 ransomware,57 and Trojans.58 For instance, one of Meta’s open foundation models, Llama 2, may have helped cy ber-attackers illicitly download other individuals’ employee login credentials.59
Finally, actors can leverage open models to exploit vulnerabilities in other AI models, through data poisoning, prompt injections, and data extractions.60
FURTHER RESEARCH
The marginal cybersecurity risk posed by dual-use foundation models with widely available model weights remains unclear, and likely varies by attack vector and the preexisting capabilities of the cyber attackers in question.61 For years, tools and exploits have become more readily accessible to lower-resourced adversaries, suggesting that foundation models may not drastically change the state of cybersecurity, but rather represent a continuation of existing trends. In the near term, the marginal uplift in capabilities for cyber attackers that widely available weights introduce to social engineering and phishing uses of foundation models may be the most significant of possible risks.62 Closed foundation models and other machine learning models that can detect software vulnerabilities, alongside other cyber-attack tools, such as Metasploit, can also be found online for free, and play a critical role in adversary emulation.63 Further, while open models could provide new instruments for performing offensive attacks, hackers may not want to invest time, energy, and resources into leveraging these models to update their existing techniques and tools.64 The extent to which a particular dual-use foundation model with widely available model weights would meaningfully increase marginal risk is therefore uncertain.
When an AI system or tool is built using a foundation model with widely available model weights, the inclusion of the model could introduce unintentional cybersecurity vulnerabilities into the application as well.65 Promisingly, it is more readily possible to prevent these types of harms –where the deployer of the model does not desire for the vulnerability to be present – than harms intentionally leveraged by the deployer of the model.66 In line with the Cybersecurity and Infrastructure Security Agency’s Secure by Design guidance, developers of AI models – whether open or closed source – can take steps to build in security from the start.67
Benefits of Widely Available Model Weights for Public Safety
The open release of foundation model weights also introduces benefits. Specifically, widely available model weights could:
- bolster cyber deterrence and defense mechanisms;
- propel safety research and help identify safety and security vulnerabilities on future and existing models; and
- facilitate transparency and accountability through third-party auditing mechanisms.
Cyber Deterrence & Defense
Open foundation models can further cyber defense initiatives. For example, various cyber defense models, such as Security-BERT,68 a privacy-preserving cyber-threat detection model, are fine-tuned versions of open foundation models.69 These models and other systems built on open models provide security benefits by allowing firms, re searchers, and users to use potentially sensitive data with out sending this data to a third-party proprietary model for processing. Models with widely available weights also have more flexibility to be narrowly optimized for a particular deployment context, including through quantization, allowing opportunities for cost savings. Thus, several intrinsic technical benefits of openness allow a wider range of users to benefit from the value foundation models introduce to securing computing systems.70 For instance, entities have published cyber-defense toolkits and create open-source channels for collaboration on cyber de fense.71
Furthermore, any advances in dual-use foundation models’ offensive cyber-attack capabilities may also strength defensive cybersecurity capabilities. If dual-use foundation models develop advanced offensive capabilities, those same capabilities can be used in securing systems and defending against cyberattacks. By detecting and ad dressing otherwise-undetected cybersecurity vulnerabilities, dual-use foundation models with widely available model weights could facilitate stronger cyber-defensive measures at scale.72 Parity of licit access to models that have offensive cyber capabilities is also important for ac curate adversary emulation, as advanced international cyber actors may incorporate such models into their own tradecraft. However, these benefits must be contextualized within the larger cyber defense landscape, as many developers perform their most effective cyber defense re search internally.
Safety Research & Identification of Safety and Security Vulnerabilities
Widely available model weights can propel AI safety re search. Open foundation models allow researchers with out in-house proprietary AI models, such as academic institutions, non-profits, and individuals, to participate in AI safety research. A broad range of actors can experiment with open foundation model weights to advance research on many topics, such as vulnerability detection and mitigation, watermarking failures, and interpretability.73
Actors can also tailor safeguards to specific use-cases, thus improving downstream models. Creating external guardrails for dual-use foundation models can pose an abstract, under-specified task; actors that use open models for specific purposes can narrow and concretize this task and add on more targeted and effective safety training, testing, and guardrails. For instance, an actor that fine-tunes a foundation model to create an online therapy chatbot can add specific content filters for harmful mental health content, whereas a general-purpose developer may not consider all the possible ways an LLM could produce negative mental health information.
Open foundation models allow a broader range of actors to examine and scrutinize models to identify potential vulnerabilities and implement safety measures and patches, permitting more detailed interrogation and testing of foundation models across a range of conditions and variables.74 This scrutiny from more individuals allows developers to understand models’ limitations and ensure models’ reliability and accuracy in scientific applications. An open model ecosystem also increases the availability of tools, such as open-source audit tooling projects, available for regulators to monitor and evaluate AI systems.75
Experimentation on model weights for research may also help propel alignment techniques. Llama 2, for example, has enabled research on reinforcement learning from human feedback (RLHF), though the underlying RLHF mechanism was first introduced by OpenAI, a closed model-weight company.76 Open models will likely help the AI community grapple with future alignment issues. However, as models develop, this research benefit should be weighed against the possibility that open model weights could enable some developers to develop, use, or finetune systems without regard for safety best practices or regulations, resulting in a race to the bottom with negative impacts on public safety or national security. Malicious actors could weaponize or misuse models, increasing challenges to effective human control over highly capable AI systems.77
Auditing and Accountability
Weights, along with data and source code, are a critical piece of any accountability regime. Widely available model weights more readily allow neutral third-party entities t o assess systems, perform audits, and validate internal developer safety checks. While access to model weights alone is insufficient for conducting more exhaustive testing, it is necessary for most useful testing of foundation models.78
Expanding the realm of auditors and allowing for external oversight regarding developers’ internal safety checks increases accountability and transparency throughout the AI lifecycle, as well as public preparedness for harms. This is for three reasons.
First, the developer may be able to use information from external auditors about its model’s robustness to improve the model’s next iteration, and other AI developers may be able to benefit from this information to identify potential vulnerability points to avoid in future models.
Second, third-party evaluations can hold developers ac countable for their internal safety and security checks, as well as downstream deployers responsible for which models they choose to use and how, which could improve accountability throughout the AI lifecycle. Of note, it may be difficult to implement such a third-party evaluation system due to differences in evaluations, the lack of ability to articulate how models can fail, and the scale of potential risks. Accessible model weights, alongside data and source code, facilitate oversight by regulatory bodies and independent researchers, allowing for more effective monitoring of AI technologies.79
Finally, a robust accountability environment may increase public trust and awareness of model capabilities, which could help society prepare for potential risks introduced by AI. The public can then respond to and develop resiliency measures to potential harms that have been demonstrated empirically. A foundation model ecosystem in which many models have widely available weights also further promotes transparency and visibility within the field, making it easier for the broader community to understand how models are developed and function.80
These community-led AI safety approaches could result in safer models, increased accountability, and improved public trust in AI and preparedness for potential risks. This transparency is vital for fostering trust between AI developers and the public and encourages accountability, as work is subject to scrutiny by the global community.81
Next: Geopolitical Considerations
33 Fine-tuning away Llama 2-Chat 13B’s safety features while retaining model performance costs less than $200. See, Gade, P., et al. (2023, October 31). BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. ArXiv.
34 Filters applied to generated content that prevent prohibited material from being returned to the user.
35 Lists of words, phrases, and topics that cannot be generated.
36 Measures intended to prohibit prompts that attempt to circumvent the aforementioned safety features. However, see generally, Can Foundation Models Be Safe When Adversaries Can Customize Them? (2023, November 2). Hai.stanford.edu; Henderson, P., et al. (2023, August 8). Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models. ArXiv.
37 See, Seger, E., et al. (2023, October 9). Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives. Social Science Research Network; Boulanger, A. (2005). Open-source versus proprietary software: Is one more reliable and secure than the other? IBM Systems Journal, 44(2), 239–248; Gade, P., et al. (2023, October 31). BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. ArXiv.
38 Mouton, Christopher A., et al., The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. Santa Monica, CA: RAND Corporation, 2024. See also CDT Comment at 19.
39 Zhan, Q. et al., (2023). Removing RLHF Protections in GPT-4 via Fine-Tuning. UIUC, Stanford.
40 See, Seger, E., et al. (2023, October 9). Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives. Social Science Research Network.
41 For instance, a method discovered to jailbreak Meta’s Llama 2 works on other LLMs, such as GPT-4 and Claude. Seger, E., et al. (2023). Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives. Social Science Research Network.
42 Li, N., et al. (2024, May 15). The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. ArXiv; Lynch, A., et al. (2024, February 26). Eight Methods to Evaluate Robust Unlearning in LLMs. ArXiv.
43 Mouton, C. A., et al. (2024, January 25). The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation.
44 Mouton, C. A., et al. (2024, January 25). The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation.
45 Congressional Research Services (2023, November 23) Artificial Intelligence in the Biological Sciences: Uses, Safety, Security, and Oversight.
46 See Johns Hopkins Center for Health Security Comment at 5. See Johns Hopkins Center for Health Security Comment at 5 (“Indeed, less than a month after Evo was released, it had already been fine-tuned on a dataset of adeno-associated virus capsids, ie, protein shells used by a class of viruses that infect humans. As this case suggests, when a model’s weights are publicly available, a developer’s decision not to endow the model with dangerous capabilities is far from final.”).
47 See generally, Nguyen, E., et al. (2024, February 27). Evo: DNA foundation modeling from molecular to genome scale. Arc Institute. https://arcinstitute.org/ news/blog/evo; Abramson, J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature (2024).
48 Sandbrink, J. (2023). Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. ArXiv.
49 One counterpoint is Google Alpha fold and predicting protein folding. Putting the power of AlphaFold into the world’s hands. (2024, May 14). Google DeepMind.
50 See EPIC Comment at 4 fn 15. See Liwei Song & Prateek Mittal, Systematic Evaluation of Privacy Risks of Machine Learning Models, 30 Proc. USENIX Sec. Symp. 2615, 2615 (2021). Hurdles to unlearning data are at the core of recent FTC cases requiring AI model deletion. See Jevan Hutson & Ben Winters, America’s Next ‘Stop Model!’: Model Deletion, 8 Geo. L. Tech. Rev. 125, 128–134 (2022).
51 Mouton, C. A., et al. (2024, January 25). The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation.
52 Mouton, C. A., et al. (2024, January 25). The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation.
53 Fang, R., et al. (2024). LLM Agents can Autonomously Exploit One-day Vulnerabilities.
54 State of Open Source AI Book 2023 Edition. (2024).
55 Cyber attackers could use foundation models in assisting in the design or deployment of sophisticated malware, including viruses, ransomware, and Trojans. For instance, Llama 2, a foundation model with widely available model weights developed by Meta, has already helped cyber-attackers design tools to illicitly download employees’ login information. Ray, T. (2024, February 21). Cybercriminals are using Meta’s Llama 2 AI, according to CrowdStrike. Zdnet. Initial evidence suggests that some closedweight foundation models can be used to “autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback” and to “autonomously find[] [cybersecurity] vulnerabilities in websites in the wild.” Fang, R., et al. (2024, February 15). LLM Agents can Autonomously Hack Websites. ArXiv.org. The National Cyber Security Centre of the Government of the United Kingdom assesses that “in the near term, [vulnerability detection and exploitation] will continue to rely on human expertise, meaning that any limited uplift [in cyberattack threat] will highly likely be restricted to existing threat actors that are already capable. . . . However, it is a realistic possibility that [constraints on expertise, equipment, time, and financial resourcing] may become less important over time, as more sophisticated AI models proliferate and uptake increases.” National Cyber Security Centre. (2024, January 24). The near-term impact of AI on the cyber threat. Should these attacks successfully target electrical grids, financial infrastructures, government agencies, and other entities critical to public safety and national security, the security implications could be significant.
56 A virus replicates itself by modifying other programs and inserting its code into those programs.
57 Ransomware is malware that holds a device or data hostage until the victim pays a ransom to the hacker.
58 A Trojan malware attack misleads users by disguising itself as a standard program.
59 Cybercriminals are using Meta’s Llama 2 AI, according to CrowdStrike. (n.d.). ZDNET.
60 Cyber attackers could possibly use dual-use foundation models with widely available model weights to perform cyberattacks on closed models or extract data from them. Actors could (i) poison models’ training data with an influx of synthetically generated content, (ii) steal model weights and other proprietary model infrastructure content through generated “jailbreaking” prompts, and (iii) leverage open models to access individual data from closed models trained on private data, which introduces privacy and autonomy concerns. See Nasr, M. (2023). Scalable Extraction of Training Data from (Production) Language Models. Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, and ETH Zurich.
61 See National Cyber Security Centre. (2024, January 24). The near-term impact of AI on the cyber threat. (“The impact of AI on the cyber threat is uneven; both in terms of its use by cyber threat actors and in terms of uplift in capability.”).
62 See National Cyber Security Centre. (2024, January 24). The near-term impact of AI on the cyber threat at 5-7.
63 MITRE ATT&CK. (n.d.); Kapoor, S. et al., (2024). On the Societal Impact of Open Foundation Models. ArXiv.
64 Kapoor, S. et al., (2024). On the Societal Impact of Open Foundation Models. ArXiv.
65 U.S. Cybersecurity and Infrastructure Security Agency. May 2024. CISA Response to NTIA Request for Information on Dual Use Foundation Artificial Intelligence Models With Widely Available Model Weights (“Foundational models have at least two classes of potential harms. […] The second class involves impacts that are undesired by those deploying the models (e.g., cybersecurity vulnerability in a model deployed by a critical infrastructure entity). […] Creators and deployers of open foundation models can take steps to mitigate the second class of harms by using a “safe by design” approach and building in protections to their model. This may address cybersecurity vulnerabilities or other forms of harms such as biases. Responsibly developed open foundation models are likely to be less susceptible to harms and misuse, on the whole, than models that cannot be publicly audited.”).
66 U.S. Cybersecurity and Infrastructure Security Agency. May 2024. CISA Response to NTIA Request for Information on Dual Use Foundation Artificial Intelligence Models With Widely Available Model Weights (“Foundational models have at least two classes of potential harms. […] The second class involves impacts that are undesired by those deploying the models (e.g., cybersecurity vulnerability in a model deployed by a critical infrastructure entity). […] Creators and deployers of open foundation models can take steps to mitigate the second class of harms by using a “safe by design” approach and building in protections to their model. This may address cybersecurity vulnerabilities or other forms of harms such as biases. Responsibly developed open foundation models are likely to be less susceptible to harms and misuse, on the whole, than models that cannot be publicly audited.”).
67 CISA. (n.d.) Secure by Design.
68 While Security-BERT is not a dual-use foundation model because it does not have “at least tens of billions of parameters,” as required by Section 3(k) of Executive Order 14110, its capabilities may be indicative of the capabilities of dual-use foundation models.
69 Ferrag, M. et al., (2024). Revolutionizing Cyber Threat Detection with Large Language Models: A privacy-preserving BERT-based Lightweight Model for IoT/ IIoT Devices. Technology Innovation Institute.
70 Alam, M. (2023). Recasting Self-Attention with Holographic Reduced Representations. ArXiv. Deng, Y. (2022). Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. ArXiv.
71 Rotlevi, S. (2024, February 15). AI Security Tools: The Open-Source Toolkit; Hughes, C. (2024, January 16). The OWASP AI Exchange: An open-source cybersecurity guide to AI components.
72 Zellers, R. et al., (2020). Defending Against Neural Fake News. ArXiv preprint; Kirchenbauer, J. (2024); On the Reliability of Watermarks for Large Language Models. International Conference on Learning Representations (ICLR); Liu, H., et al. (2023). Chain of Hindsight Aligns Language Models with Feedback. ArXiv; Belrose, N. LEACE: Perfect linear concept erasure in closed form. 37th Conference on Neural Information Processing Systems (NeurIPS 2023); Bhardwaj, R. & Poria, S. (2023). Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. ArXiv preprint; Zou, A. et al (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv preprint.
73 Zellers, R. et al., (2020). Defending Against Neural Fake News. ArXiv preprint; Kirchenbauer, J. (2024); On the Reliability of Watermarks for Large Language Models. International Conference on Learning Representations (ICLR); Liu, H., et al. (2023). Chain of Hindsight Aligns Language Models with Feedback. ArXiv; Belrose, N. LEACE: Perfect linear concept erasure in closed form. 37th Conference on Neural Information Processing Systems (NeurIPS 2023); Bhardwaj, R. & Poria, S. (2023). Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. ArXiv preprint; Zou, A. et al (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. ArXiv preprint.
74 See CCIA Comment at 1-2 (“Open models also present advantages in AI governance, being easier to understand and test.”).
75 Mozilla Open Source Audit Tooling (OAT) Project. (n.d.). Mozilla.
76 Lambert, N. et al. (2024, June 8). RewardBench: Evaluating Reward Models for Language Modeling. ArXiv.
77 Hubinger, E., et al. (2024, January 17). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. ArXiv; Measuring the impact of post-training enhancements. (n.d.). METR’s Autonomy Evaluation Resources.
78 See also AI Accountability Policy Report, National Telecommunications and Information Administration. (2024, March) at 70 (noting that “[i]ndependent AI audits and evaluations are central to any accountability structure []”).
79 See Anthony Barret Comment at 2 (“Although both closed and open models can pose some such risks, unsecured models pose unique risks in that safety and ethical safeguards that were implemented by developers can be removed relatively easily from models with widely available weights (e.g., via fine tuning).”) (citation omitted).
80 See generally, Hofmann, et al. (2024, March 1). Dialect prejudice predicts AI decisions about people’s character, employability, and criminality. ArXiv.
81 Examining Malicious Hugging Face ML Models with Silent Backdoor. (2024, February 27). JFrog.