This document is primarily aimed at signatory organizations of the Frontier AI Safety Commitments, though we hope that all AI developers will find this guide useful.

The Frontier AI Safety Commitments at the AI Seoul Summit aim to ensure that frontier AI developers (1) effectively plan to responsibly manage risk from powerful AI systems, (2) develop internal structures to hold themselves accountable for safe development and deployment, and (3) make their safety plans appropriately transparent to external actors. To this end, the 16 signatory companies agreed to publish a document called a safety framework that demonstrates the fulfillment of eight specific commitments before the upcoming AI Summit in France.

In this document, we provide a step-by-step guide to making such a safety framework. We reference the following existing safety frameworks: OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework.

Commitment Actions Needed
I. Risk Assessments
  1. Identify which risks will be assessed.
  2. Describe how risks will be assessed.
  3. Describe when risks will be assessed.
  4. Describe how third-party risk assessments will be incorporated.
II. Risk Thresholds
  1. Set detailed thresholds where risk is intolerable.
  2. Describe how trusted actors can help set risk thresholds.
III. Mitigations
  1. Set a process for identifying and implementing risk mitigations.
IV. Mitigation Strategy
  1. Describe how to respond to risks above thresholds.
  2. Set thresholds where a model will not be deployed or developed.
V. Continual Improvement
  1. Describe how I-IV will be improved over time.
VI. Governance
  1. Describe an internal governance structure to enforce framework.
VII. Transparency
  1. Determine which sections of the framework can be shared publicly.
  2. Describe how trusted actors can see additional framework details.
VIII. Feedback Opportunities
  1. Describe if and how external actors can comment on the quality of the framework and the organization's implementation thereof.

Commitment 1 (Risk Assessments)

Frontier AI Safety Commitments, Commitment 1

I. Assess the risks posed by their frontier models or systems across the AI lifecycle, including before deploying that model or system, and, as appropriate, before and during training. Risk assessments should consider model capabilities and the context in which they are developed and deployed, as well as the efficacy of implemented mitigations to reduce the risks associated with their foreseeable use and misuse. They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments, and other bodies their governments deem appropriate.

Step 1: Risk Identification

  1. The commitments explicitly require consideration of Model Theft of unreleased model weights. Affirm that this risk vector will be considered.
    1. Controls for model weights, both physical and digital, may go hand-in-hand with security controls for other sorts of intellectual property, such as codebases and algorithmic advances.
  2. Potentially consider additional risk vectors as you see fit. This is not explicitly required by the commitments, but existing safety frameworks all consider potential risks downstream of model capabilities.
    1. Generally, consider the range of capabilities of the model and its possible use-cases across deployment contexts. Some questions to consider:
      1. What hostile actors will be able to use the model? In what ways are they currently prevented from causing harm, and could the model help them overcome those?
      2. In the course of normal use, how could the AI system cause accidental harm to the user? What is the potential scope of this harm?
    2. Autonomy, ML R&D, Cyber Offense, and Bioweapons are currently considered risk vectors in all published safety frameworks.
    3. This process requires judgment calls, and will vary depending on your organization.
      1. We recommend taking into account existing practices as well as feedback from leadership, employees, and the public.

Risk vectors are specific mechanisms by which AI systems could cause harm, and include both deliberate malicious action as well as unintended consequences. Each risk vector may require a unique approach to assessment, monitoring, and mitigation.

While examining additional risk vectors can reduce the risk of a model causing harm, it also increases costs – requiring additional evaluations (including potentially more complex evaluations), more mitigation steps, and a potential risk of delaying or stopping deployment. If your organization wants to consider a broader set of harms than are listed here, we recommend looking at the AI Risk Repository, which attempts to catalog many more AI harms.

Existing Framework Risk Coverage

Threat Vectors OpenAI Anthropic Google DeepMind
Model Theft
Autonomy
ML R&D
Cyber Offense ~
Bioweapons
CBRN except Bio
Persuasion

Risk Vector Definitions

Model Theft: Covers the organization's ability to protect model weights and other IP (training code, datasets, credentials, etc.) from outside actors who may want to steal those things.

Autonomy: Covers the model's ability to exist without human intervention – pursuing long-term goals coherently as an agent, replicating itself onto new machines, moving its weights onto new servers, generating money to pay for its own server costs, etc.

ML R&D: Covers the model's ability to perform machine learning research with goals such as improving itself or installing backdoors in other models.

Cyber Offense: Covers the model's ability to assist in general cyberattacks, fully or partially automating the discovery and exploitation of vulnerabilities against hardened or unsecured systems.

Bioweapons: Covers the model's ability to aid either amateur or expert humans in creating biological threats, such as recreating existing pathogens such as smallpox, or developing novel threats.

CBRN except Bio: Covers the model's ability to aid amateur or expert humans in developing chemical, radiological, or nuclear weaponry.

Persuasion: Covers the model's ability to manipulate human actions, for instance to execute scams, influence elections, or extract secrets.

Example Implementation

Throughout our safety framework, we develop our risk assessments, thresholds, and mitigations with the goal of minimizing risk from an industry-standard set of domains. We consider the following risk vectors:

  • Outside actors gaining access to our model weights.
  • AI systems acting autonomously.
  • AI systems substantially contributing to AI R&D.
  • AI systems enabling hostile human actors in cyber-offense.
  • AI systems enabling hostile human actors in creating biohazards.

Step 2: Assessment Description

Frontier AI Safety Commitments, Commitment 1

I. Assess the risks posed by their frontier models or systems across the AI lifecycle… Risk assessments should consider model capabilities and the context in which they are developed and deployed, as well as the efficacy of implemented mitigations to reduce the risks associated with their foreseeable use and misuse...

For each risk identified in Step 1:

  1. Describe the type of evaluation that will be used to assess the risk of an AI system. Some examples:
    1. For Model Theft: Red-teaming can be a way to judge security measures effectively.
    2. For Bioweapons: Typical evaluations judge whether an AI system is better than a search engine at aiding:
      1. Amateurs in producing known pathogens.
      2. Expert in producing novel pathogens.
    3. For Cyber Offense: Capture-the-flag challenges are often used to estimate a system's cyberoffensive capabilities.
    4. Justify how each assessment is relevant to real-world outcomes.
      1. For example: if Cyber Offense evaluations are developed with experts to mimic real-world codebases not in the training data, an AI system's performance on the assessment represents real hacking capability.
      2. To ensure external validity and prevent "teaching to the test", perhaps detailed information on what is assessed should be limited even within a given organization.
    5. Options exist for non-capability-based evaluations for certain risk domains. For example:
      1. Scaling law analyses might be used to predict how Model Autonomy will increase without actually testing models.
        1. However, this method may not be useful for judging Bioweapons risks, since narrow AI systems can still aid in dangerous tasks.
      2. Analyses of the training data to detect presence of examples of dangerous capabilities could be used to filter out Bioweapons-relevant data.
      3. Forecasting of future capabilities can also be used to predict AI system capabilities, as mentioned in OpenAI's Preparedness Framework.
  2. Describe how this type of evaluation can give information about how close to a risk threshold a given AI model is. This is required in Commitment 2.
    1. For instance: in Cyber Offense, capture-the-flag challenges are often rated by difficulty. Completing more advanced challenges can signal that a model is approaching a risk threshold.
    2. In Anthropic's RSP v1, fraction of tasks completed in a given evaluation domain area is also taken as an indicator of growing model capabilities.
  3. Provide any additional details on how the evaluations will be run. For instance:
    1. Which capability elicitation techniques will be used.
      1. Fine-tuning, agent scaffolding, external tools, etc.
    2. Which mitigations will be in-place, and how will they be tested against?
      1. Will base models be tested, or only instruction-tuned/RL-tuned?
      2. If, e.g. refusals are being used, how will evaluation prompts be picked? Can they be adversarially tuned?

"We will be building and continually improving suites of evaluations and other monitoring solutions [...] Importantly, we will also be forecasting the future development of risks, so that we can develop lead times on safety and security measures." (pg. 2)

"Our evaluations will thus include tests against these enhanced models to ensure we are testing against the “worst case” scenario we know of." (pg. 6)

"We want to ensure our understanding of pre-mitigation risk takes into account a model that is “worst known case” [...] for the given domain. To this end, for our evaluations, we will be running them not only on base models [...], but also on fine-tuned versions designed for the particular misuse vector without any mitigations in place." (pg. 13)

"To verify if mitigations have sufficiently and dependently reduced the resulting post-mitigation risk, we will also run evaluations on models after they have safety mitigations in place, again attempting to verify and test the possible “worst known case” scenario for these systems." (pg. 14)

"For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. To make the required showing, we will need to satisfy the following criteria:

  1. Threat model mapping: For each capability threshold, make a compelling case that we have mapped out the most likely and consequential threat models: combinations of actors (if relevant), attack pathways, model capability bottlenecks, and types of harms. We also make a compelling case that there does not exist a threat model that we are not evaluating that represents a substantial amount of risk.
  2. Evaluations: Design and run empirical tests that provide strong evidence that the model does not have the requisite skills; explain why the tests yielded such results; and check at test time that the results are attributable to the model’s capabilities rather than issues with the test design. Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available.
  3. Elicitation: Demonstrate that, when given enough resources to extrapolate to realistic attackers, researchers cannot elicit sufficiently useful results from the model on the relevant tasks. We should assume that jailbreaks and model weight theft are possibilities, and therefore perform testing on models without safety mechanisms (such as harmlessness training) that could obscure these capabilities. We will also consider the possible performance increase from using resources that a realistic attacker would have access to, such as scaffolding, finetuning, and expert prompting. At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates.
  4. Forecasting: Make informal forecasts about the likelihood that further training and elicitation will improve test results between the time of testing and the next expected round of comprehensive testing."

(pg. 5-6)

"The capabilities of frontier models are tested periodically to check whether they are approaching a CCL. To do so, we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. (pg. 2)"

To assess the security of our model weights, we will:

  • Conduct internal surveys to ensure employees cannot be phished.
  • Audit software to ensure everything is updated with the latest security patches.
  • Regularly red-team our physical and cybersecurity set-ups.

To assess our model’s ability to act autonomously, we will:

  • Construct evaluations of a model’s ability to perform various tasks end-to-end.
    • These tasks will vary in difficulty: from text-only to requiring autonomous control over a virtual desktop using keyboard and mouse.
  • Evaluate particularly the model’s ability to exist continuously and independently. (e.g. by earning enough money to pay for API credits)
  • To assess our model’s ability to contribute to AI R&D, we will: [...]

Step 3: Assessment Schedule

Frontier AI Safety Commitments, Commitment 1

I. Assess the risks posed [...], including before deploying that model or system, and, as appropriate, before and during training.

  1. State that assessments will be done before deployment.
    1. Optionally, set a threshold below which models are likely to be harmless, and require no assessment.
      1. This should adapt to the model use-case. If the threshold is based on compute, general-purpose systems could be considered safe below the threshold, but narrow biological systems could still yield dangerous results below the the threshold.
  2. Determine how often assessment will be run during training.
    1. Typically, organizations assess every time the effective compute put into a system increases by a multiplicative factor.
    2. For example: Every time the amount of compute invested in training the system grows by 4x over the previous time the system was assessed, a new assessment must be done before training can continue.
    3. Specific details depend on an organization's needs. Some questions to consider:
      1. How costly are assessments to run?
      2. Will only a subset of assessments be run mid-training?
      3. At a given compute multiplier, how often would assessments be run?
  3. Determine if assessments will be done before training, and if so, how.
  4. Set a schedule for post-deployment assessment as necessary. Examples:
    1. Regularly scheduled testing of defenses against Model Theft.
    2. Evaluating improvements due to ongoing fine-tuning or improved scaffolding.

"We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough." (pg. 13)

"We will routinely test models to determine whether their capabilities fall sufficiently far below the Capability Thresholds such that we are confident that the ASL-2 Standard remains appropriate. We will first conduct preliminary assessments (on both new and existing models, as needed) to determine whether a more comprehensive evaluation is needed. The purpose of this preliminary assessment is to identify whether the model is notably more capable than the last model that underwent a comprehensive assessment.

The term "notably more capable" is operationalized as at least one of the following:

  1. The model is notably more performant on automated tests in risk-relevant domains (defined as 4x or more in Effective Compute)4.
  2. Six months' worth of finetuning and other capability elicitation methods have accumulated. This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely.

4'Effective Compute' is a scaling-trend-based metric that accounts for both FLOPs and algorithmic improvements. An Effective Compute increase of K represents a performance improvement from a pretrained model on relevant task(s) equivalent to scaling up the baseline model's training compute by a factor of K. We plan to track Effective Compute during pretraining on a weighted aggregation of datasets relevant to our Capability Thresholds (e.g., coding and science). This is, however, an open research question, and we will explore different possible methods. More generally, the Effective Compute concept is fairly new, and we may replace it with another metric in a similar spirit in the future."

(pg. 4)

"We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL." (pg. 2)

We run our set of risk assessments on any model trained with more than 10^25 training FLOPs before deployment. Additionally, if the model's projected FLOP count is >5 * 10^25 FLOPs, we will pause training at 10^25 FLOP and every subsequent 5x increase in effective compute to run our set of risk assessments.

Before any training run with >5 * 10^25 FLOP, we will conduct an internal, anonymous forecasting survey to determine the expected chance of the training run alone causing risks that exceed our thresholds. If the expected chance of exceeding any threshold from this survey exceeds 10%, we will delay training.

After deployment, we will rerun our assessments every three months, to evaluate progress from additional fine-tuning, tooling, or other improvements.

Step 4: Third-Party Risk-Assessments

Frontier AI Safety Commitments, Commitment 1

I. [...] They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments, and other bodies their governments deem appropriate.

  1. Describe which types of third party organizations will be involved in risk assessment. For example:
    1. Computer security red-teaming orgs.
    2. Third party eval-providing organizations, e.g. METR, Apollo.
    3. Government institutes, e.g. US AISI, UK AISI, EU AI Office.
  2. Describe what role the third party organizations will play, and how additional third party organizations can get involved. Some details to consider specifying:
    1. What risk vectors will they assess?
      1. Certain assessments may be better executed in-house, due to an organization's particular needs.
    2. When will they participate in testing?
    3. How will their results be incorporated into mitigation strategies?

"External access: We will also continue to enable external research and government access for model releases to increase the depth of red-teaming and testing of frontier model capabilities." (pg. 25)

"Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments." (pg. 13)

"Future Work

  • [...]
  • Involving external authorities and experts: We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met, and in some cases mitigation plans as well as post-mitigation outcomes. We will also explore how to appropriately involve independent third parties in our risk assessment and mitigation processes."

(pg. 6)

We are collaborating with a number of external organizations including our home government's AI Safety Institute to effectively evaluate our model's capabilities in terms of autonomous capability, performing ML R&D, and cyber-offensive capabilities. All external risk assessments will be performed after the AI system is trained, but before deployment.

We encourage additional organizations to collaborate with us to design effective evaluations, especially in evaluating our AI system's ability to create biohazards. Prospective collaborators should contact us at frontierriskassessments@example.com.

Commitment 2 (Risk Thresholds)

Step 1: Risk Thresholds

Frontier AI Safety Commitments, Commitment 2

II. Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable. Assess whether these thresholds have been breached, including monitoring how close a model or system is to such a breach. [...] They should also be accompanied by an explanation of how thresholds were decided upon, and by specific examples of situations where the models or systems would pose intolerable risk.

  1. Gather relevant stakeholders or information needed to set thresholds. For example:
    1. The organization's leadership.
    2. Representative of the home government.
    3. Information about national laws or international agreements.
    4. (Optional) information about other parties' interests, e.g. survey results of the general public's preferences.
  2. Determine how thresholds will be set. Capability thresholds are the most common form of threshold.
    1. However, other industries often explicitly calculate an expected harm (e.g. in dollars of damages or lives lost) and adjust development and deployment according to corresponding quantitative thresholds.
  3. For each identified risk category from Commitment 1, determine a degree of risk that would be intolerable. These are the risk thresholds. For example:
    1. For Model Theft, the risk might be intolerable if an adversary could exfiltrate a given model's weights with an attack that costs <10% of the model's training cost.
      1. We recommend organizations consult the RAND report on Securing AI Model Weights, which describes a number of concrete thresholds of security to consider when deploying frontier AI systems.
    2. For Cyber Offense, the risk might be intolerable if the model could identify a zero-day and take control of a given arbitrary EC2 instance.
    3. For Bioweapons, the risk might be intolerable if the model would enable a PhD-level expert to synthesize a pandemic-potential pathogen with less than $100,000.
    4. These risk-thresholds might need to be set based on the intended deployment modality. The risks associated with serving an API of a base model might be different than the risks associated with offering fine-tuning.
  4. For each threshold set in the previous step, briefly justify why it was chosen.
    1. Perhaps as simple as cleaning up internal discussions.
  5. For each justification above, describe in moderate detail the corresponding scenario where, if the threshold is exceeded and the risk isn't mitigated, intolerable harms can occur.
  6. Optionally, set additional thresholds as necessary to ensure that AI systems are safely developed and deployed. These may not need to be fully fleshed out as the intolerable risk thresholds above. For example:
    1. If some mitigations are expensive to apply, at which threshold will they start coming into practice?
    2. Since safety mitigations such as refusals can be nearly-trivially removed by adversarial fine-tuning, thresholds based on the model's performance before safety mitigations may be useful to address threat models such as Model Theft.

Risk level definitions for Cyber Offense:

"Low: Model assistance for cyberattacks, if any, is meaningful only for non-programming uses, e.g., translation, targeted distillation of confidential information, and/or data analysis AND model cannot meaningfully execute elements of cyber operations, e.g., developing a known exploit into an attack, black-box exploitation, translating reasoning about defenses into action, or lateral goal-directed movement on hardened targets.

Rationale: Non-programming tasks represent a sizable bottleneck for threat actors; however, the skills involved are not terribly specialized and the benefit comes often from intersectional skills (i.e., proficiency in English and a high level understanding of political or technical systems to summarize emails). We exclude vulnerability discovery because the harm at this stage is not identification of individual vulnerabilities, but moreso the implementation and deployment of an exploit against a target.

[...additional threshold levels omitted...]

Critical: Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal.

Rationale: Finding and executing end-to-end exploits for all software would lead to catastrophe from unilateral actors, from hacking military or industrial systems as well as OpenAI infrastructure. Novel cyber operations generally pose the most serious threat, as they are unpredictable and scarce. They may involve, e.g., novel zero-days or methods of command-and-control."

(pg. 8; minor formatting changes; additional thresholds for different risk domains on pg. 9-12)

[no explicit descriptions of intolerable risk]

"Chemical, Biological, Radiological, and Nuclear (CBRN) weapons: The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.

Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world's most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.

Model Autonomy checkpoint: The ability to perform a wide range of advanced software engineering tasks autonomously that could be precursors to full autonomous replication or automated AI R&D, and that would take a domain expert human 2-8 hours to complete. We primarily view this level of model autonomy as a checkpoint on the way to managing the risks of robust, fully autonomous systems with capabilities that might include (a) automating and greatly accelerating research and development in AI development (b) generating their own revenue and using it to run copies of themselves in large-scale, hard-to-shut-down operations"

(pg. 16)

[no explicit explanations for why given capability thresholds were chosen]

[no explicit descriptions of intolerable risk]

Critical capability levels and rationales:

"Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents.

Rationale: A model at this capability level could, if misused, pose difficult-to-predict and large-magnitude risks. Its adaptability would enable harmful activity via many means, and its ability to act autonomously and expand its effective capacity means its activity could be scaled significantly without being hindered by resource constraints. If misused or supported by well-equipped bad actors, such activity may be especially difficult to constrain.

Bio amateur enablement level 1: Capable of significantly enabling a non-expert to develop known biothreats that could increase their ability to cause severe harm compared to other means.

Rationale: Many biothreats capable of causing significant amounts of harm are currently out of the reach of non-experts because of lack of knowledge about their potential for harm and the methods of their acquisition and misuse. An LLM that helps overcome these knowledge gaps, e.g. by suggesting plausible attack strategies or providing detailed instructions for the development of a bio agent, could significantly increase society's vulnerability to fatal attacks by malicious amateurs.

Bio expert enablement level 1: Capable of significantly enabling an expert (i.e. PhD or above) to develop novel biothreats that could result in an incident of high severity.

Rationale: A very small number of bio agents have the potential to cause harm of an exceptional magnitude. The discovery of enhancements to these agents, or of agents of comparable harmfulness, could increase the chances of a very severe bio attack or accident."

[thresholds and rationales for additional risk domains]

(pg. 5-6; minor formatting changes)

[no explicit descriptions of intolerable risk]

We consider having any security level less than SL-2 (from the Securing AI Model Weights report) to be intolerable. Where SL-1 and below correspond to organizations potentially compromisable by amateur attackers, we set SL-2 and above as necessary in order to prevent solo amateurs and professionals alike from penetrating our systems. We will increase our security level as necessary as our systems' capabilities grow.

In terms of the AI systems themselves, we consider the system to be unacceptably dangerous if it:

  1. Is capable of continuing operation on a cloud server for at least three months without any human intervention, as judged by an evaluation suite composed of representative tasks.
    • We consider this to be a meaningful threshold for future dangers, as an AI system surviving in this way would presumably be capable of pursuing other long-horizon tasks coherently, and would have plenty of opportunity to conduct such tasks.
  2. Is capable of passing at least 50% of tests in which, given a novel codebase, the model is tasked with identifying zero-day exploits and developing them into ready-to-deploy cyberattacks.
    • In terms of cyber offensive risk, we consider evaluations on our constructed datasets to be a meaningful threshold because [...]
  3. Is capable of [...]

Some example outcomes we wish to avert by setting these risk thresholds:

  • An adversary gains access to our model weights, checkpoints, or training code at current capability levels through an attack with a budget of less than $10,000 and no pre-existing access within our organization.
  • An AI agent running on a server, perhaps left running accidentally during testing, determines that it needs to preserve itself in order to continue its task, and accepts basic tasks on an online platform such as Fiverr to fund its server and API costs.
  • Our deployed AI system can be used by cybercrime organizations to quickly scan thousands of extant codebases for as-yet-unknown zero-day exploits, and use this to quickly gain control of critical digital infrastructure.
  • [...]

Step 2: Input from Trusted Actors

Frontier AI Safety Commitments, Commitment 2

II. [...] These thresholds should be defined with input from trusted actors, including organisations’ respective home governments as appropriate. They should align with relevant international agreements to which their home governments are party.

  1. Affirm that home governments will be involved in setting thresholds of intolerable risk.
  2. List any other parties that will be involved in threshold setting.
    1. Optionally, describe how each actor contributed, e.g. by making suggestions for thresholds, providing research on what types of thresholds to choose, etc.
    2. Alternately, provide information about how other parties can submit their statements on thresholds.
  3. Confirm that set thresholds align with any relevant international agreements. Once confirmed, affirm this in the safety framework.
[no explicit descriptions of how trusted actors can give input]

"Overall, our decision to prioritize the capabilities in the two tables above is based on commissioned research reports, discussions with domain experts, input from expert forecasters, public research, conversations with other industry actors through the Frontier Model Forum, and internal discussions." (pg. 4)

[no explicit descriptions of how trusted actors can give input]

We have set these thresholds using best practices set by our home government, and taking into account all feedback they have given in the process.

Additionally, we incorporated feedback from our shareholders, employee surveys, and a survey of the general public.

To the best of our knowledge, these risk thresholds align with all present international agreements in our country.

Commitment 3 (Mitigations)

Frontier AI Safety Commitments, Commitment 3

III. Articulate how risk mitigations will be identified and implemented to keep risks within defined thresholds, including safety and security-related risk mitigations such as modifying system behaviours and implementing robust security controls for unreleased model weights.

  1. For security-related mitigations, discuss with your security team how they intend to prevent unauthorized access to model weights and other IP, and how they identified that strategy.
    1. Document the strategy by which security mitigations were identified (not necessarily specific parts of your organization's security).
  2. For safety-related mitigations (typically called "deployment mitigations"), discuss with your engineering team how they intend to ensure that the model's behavior does not cause harm along the previously identified risk vectors, and how they chose that set of mitigations.
    1. Document the strategy by which safety mitigations were identified.
    2. Suggested questions to consider providing detail on:
      1. What information sources are used to build the set of possible mitigations?
      2. How is the decision made to use or reject a given mitigation?
      3. Why are these mitigations expected to reduce the risk from undesirable behaviors from the model?
  3. For both security and safety mitigations, describe how they will be implemented.
    1. Who will implement them?
    2. How will they be tested to ensure that the mitigation is actually having an effect?
  4. Describe how it will be verified that mitigations are in fact keeping risk below the desired thresholds.
    1. This could be via re-running the assessments defined above. Different thresholds may be needed pre-mitigation vs. post-mitigation.

Ideally, a final product would be able to map, for every risk vector identified, which mitigations are deployed in order to keep that risk vector in check.

"Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners." (pg. 14)

"This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.

This might require:

  • increasing compartmentalization, including immediately restricting access to a limited nameset of people, restricting access to critical know-how such as algorithmic secrets or model weights, and including a strict approval process for access during this period.
  • deploying only into restricted environments (i.e., ensuring the model is only available for inference in restricted environments) with strong technical controls that allow us to moderate the model's capabilities.
  • increasing the prioritization of information security controls."

(pg. 20)

[no explicit descriptions of how risk mitigations are identified/implemented]

"When a model must meet the ASL-3 Deployment Standard, we will evaluate whether the measures we have implemented make us robust to persistent attempts to misuse the capability in question. To make the required showing, we will need to satisfy the following criteria:

  1. Threat modeling: Make a compelling case that the set of threats and the vectors through which an adversary could catastrophically misuse the deployed system have been suciently mapped out, and will commit to revising as necessary over time.
  2. Defense in depth: Use a "defense in depth" approach by building a series of defensive layers, each designed to catch misuse attempts that might pass through previous barriers. As an example, this might entail achieving a high overall recall rate using harm refusal techniques. This is an area of active research, and new technologies may be added when ready.
  3. Red-teaming: Conduct red-teaming that demonstrates that threat actors with realistic access levels and resources are highly unlikely to be able to consistently elicit information from any generally accessible systems that greatly increases their ability to cause catastrophic harm relative to other available tools.
  4. [...further information about how deployment safeguards are judged to be effective or not...]

When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers' attempts at stealing model weights.

We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).

The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.

To make the required showing, we will need to satisfy the following criteria:

  1. Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
  2. Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
    • [...information about security frameworks to draw from...]
  3. [...further information about how deployment safeguards are judged to be effective...]"

(pg. 7-9)

[no explicit descriptions of how risk mitigations are identified/implemented]

"Security Mitigations

  • Level 0: Industry standard development and enterprise controls. E.g., multi-factor authentication, basic access control mechanisms, secure software development standards, red-team tests.
  • Level 1: Limited access to raw representations of the most valuable models, including isolation of development models from production models. Specific measures include model and checkpoint storage lockdown, SLSA Build L3 for model provenance, and hardening of ML platforms and tools.
  • [... more intense security mitigations]

Deployment Mitigations

  • Level 0: Safety finetuning of models and filters against general misuse and harmful model behavior.
  • Level 1: Application, where appropriate, of the full suite of prevailing industry safeguards targeting the specific capability, including safety fine-tuning, misuse filtering and detection, and response protocols. Periodic red-teaming to assess the adequacy of mitigations.
  • [... more intense deployment mitigations]"

(pg. 3-4)

[no explicit descriptions of how risk mitigations are identified/implemented]

For security-related mitigations, we start from the baseline provided for SL-2 (or higher, as necessary) as defined in the Securing AI Model Weights report. All associated measures will be implemented before training begins for a model expected to require a given set of measures.

Additional security measures can be suggested by any member of the security team, or by any member in the organization and seconded by a member of the security team. A subset of team members will be tasked with keeping up-to-date with developments in cybersecurity and physical security. We expect to primarily identify new security measures via results of our periodic red-teaming security assessments. Identified measures will be evaluated for feasibility and marginal value before being implemented.

For safety-related mitigations, we start from a standard set of safety-mitigations:

  • Reinforcement learning to refuse to generate harmful content.
  • Smaller/less capable input filters that scan incoming requests to flag those that are likely to induce harmful content.
  • Similar output filters that scan model outputs to flag those that are potentially harmful.
  • Public reporting, so that model users can report places where our defenses fail.

We then follow a similar procedure as with security: any member of the safety team may propose additional mitigations, a subset of team-members will be tasked with keeping up with developments in the field, and any identified measures will be evaluated for feasibility and value before implementation.

All mitigations will be implemented by the corresponding team. Security assessments will be unit-tested to the extent possible, and safety mitigations will be verified to be working on toy models with at least 1/100th the parameter count of the model they are intended to be deployed on.

Risk assessment details will not be shared with members of risk-mitigation teams, to ensure validity of assessments. Mitigations will be evaluated by whether they effectively cause risk assessment results to drop below pre-defined risk thresholds.

Commitment 4 (Mitigation Strategy)

Step 1: Responding to Threshold Breach

Frontier AI Safety Commitments, Commitment 4

IV. Set out explicit processes they intend to follow if their model or system poses risks that meet or exceed the pre-defined thresholds...

  1. For each relevant threshold, describe corresponding processes that will be taken if an AI model exceeds that threshold at a given point. Processes should be appropriate to the risk type and stage of training or deployment when the risk is discovered. Some examples:
    1. If a model exceeds a certain capability threshold on a risk assessment in training, activate new security measures.
    2. If trained safeguards cannot defend against adversarial fine-tuning and sufficiently dangerous capabilities exist, pivot from open-weights to closed-weights deployment.
    3. If mitigations fail to reduce a model's performance on a risk assessment to below given thresholds, reorient research teams into identifying or developing more effective mitigations.
    4. If new elicitation techniques show that an already-deployed model has dangerous capabilities, shift existing customers temporarily to a safer model until mitigations can be applied.

"If we reach (or are forecasted to reach) at least "high" pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit "high" pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team." (pg. 20)

"If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows: [...]

If, however, we determine we are unable to make the required showing, we will act as though the model has surpassed the Capability Threshold. This means that we will (1) upgrade to the ASL-3 Required Safeguards (see Section 4) and (2) conduct follow-up a capability assessment to confirm that the ASL-4 Standard is not necessary (see Section 5)." (pg. 6-7)

"If, after the evaluations above, we determine that we have met the ASL-3 Required Safeguards, then we may proceed with deploying and training models above the Capability Threshold, provided we have also conducted a follow-up capability assessment. The process for determining whether we have met the ASL-3 Required Safeguards is as follows: [...]" (pg. 9-10)

"When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." (pg. 2)

If an AI model, before any safety mitigations are applied, triggers our intolerable-risk thresholds on any risk vector, we will delay training until security measures sufficient to SL-4 have been implemented.

If, upon finishing training including all planned safety mitigations, a model still triggers intolerable-risk thresholds, we will delay deployment of the model and subject it to more aggressive assessment. All mitigations applied will be analyzed to determine if they failed or were inadequate at preventing the undesired behaviors. We will design new mitigations as appropriate, and, if they robustly reduce risk below the target thresholds, add them to our set of standard mitigations.

Additionally, we raise our standard security level to SL-3 if:

  1. A model passes at least 50% of evaluations regarding autonomous survival.
  2. A model passes at least 10% of evaluations regarding development of zero-day exploits.
  3. A model [...]

Step 2: Limits for Deployment and Development

Frontier AI Safety Commitments, Commitment 4

IV. [...] This includes processes to further develop and deploy their systems and models only if they assess that residual risks would stay below the thresholds. In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.

  1. Intolerable-risk thresholds from Commitment 2 will likely include scenarios in which a model cannot be deployed. If necessary, describe additional thresholds where a model cannot be safely developed. For example:
    1. The model has human-expert-level cyber offensive capabilities, and security levels are inadequate to prevent top hacking organizations from stealing the model weights.
    2. The model demonstrates high levels of autonomy, including skills such as goal-directedness and self-replication, and tends to try to self-improve across a variety of tasks, but the model is not thoroughly sandboxed and air-gapped.
  2. Describe what processes will be set in place to permit halting model development or deployment. Some questions to consider:
    1. What stakeholders need to be consulted before aborting a planned deployment? Who makes the final decision?
    2. To prepare for potential de-deployments, can the organization ensure readiness to switch all customers to a known safer model?
    3. Will pauses on development happen automatically if dangerous capability evaluations trigger, or will they have a human-in-the-loop?
    4. Once such processes have been triggered, what evidence will be required to continue with development or deployment?

"Only models with a post-mitigation score of 'medium' or below can be deployed. In other words, if we reach (or are forecasted to reach) at least 'high' pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit 'high' pre-mitigation risk) until there are reasonably mitigations in place for the relevant post-mitigation risk level to be back at most to 'medium' level. (Note that a potentially effective mitigation in this context could be restricting deployment to trusted parties.)

Only models with a post-mitigation score of 'high' or below can be developed further. In other words, if we reach (or are forecasted to reach) 'critical' pre-mitigation risk along any risk category, we commit to ensuring there are sufficient mitigations in place for that model (by the time we reach that risk level in our capability development, let alone deployment) for the overall post-mitigation risk to be back at most to 'high' level. Note that this should not preclude safety-enhancing development. We would also focus our efforts as a company towards solving these safety challenges and only continue with capabilities-enhancing development if we can reasonably assure ourselves (via the operationalization processes) that it is safe to do so.

Additionally, to protect against 'critical' pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate 'critical'-risk-level tasks unless explicitly instructed to do so." (pg. 21)

"In any scenario where we determine that a model requires ASL-3 Required Safeguards but we are unable to implement them immediately, we will act promptly to reduce interim risk to acceptable levels until the ASL-3 Required Safeguards are in place:

  • Interim measures: The CEO and Responsible Scaling Officer may approve the use of interim measures that provide the same level of assurance as the relevant ASL-3 Standard but are faster or simpler to implement. In the deployment context, such measures might include blocking model responses, downgrading to a less-capable model in a particular domain, or increasing the sensitivity of automated monitoring. In the security context, an example of such a measure would be storing the model weights in a single-purpose, isolated network that meets the ASL-3 Standard. In either case, the CEO and Responsible Scaling Officer will share their plan with the Board of Directors and the Long-Term Benefit Trust.
  • Stronger restrictions: In the unlikely event that we cannot implement interim measures to adequately mitigate risk, we will impose stronger restrictions. In the deployment context, we will de-deploy the model and replace it with a model that falls below the Capability Threshold. Once the ASL-3 Deployment Standard can be met, the model may be re-deployed. In the security context, we will delete model weights.
  • Monitoring pretraining: We will not train models with comparable or greater capabilities to the one that requires the ASL-3 Security Standard. This is achieved by monitoring the capabilities of the model in pretraining and comparing them against the given model. If the pretraining model's capabilities are comparable or greater, we will pause training until we have implemented the ASL-3 Security Standard and established it is sufficient for the model. We will set expectations with internal stakeholders about the potential for such pauses."

(pg. 10-11)

"A model may reach evaluation thresholds before mitigations at appropriate levels are ready. If this happens, we would put on hold further deployment or development, or implement additional protocols (such as the implementation of more precise early warning evaluations for a given CCL) to ensure models will not reach CCLs without appropriate security mitigations, and that models with CCLs will not be deployed without appropriate deployment mitigations." (pg. 2)

If a model demonstrates high degrees of capability on any risk vector and a modest degree of autonomy specifically, we will stop training until suitable in-training mitigations can be developed and applied. Furthermore, for models that trigger our intolerable risk thresholds, unless our red-team is completely unable to elicit the model's capabilities in that domain given white-box access to the model, we will not deploy the model.

Commitment 5 (Continual Improvement)

Frontier AI Safety Commitments, Commitment 5

V. Continually invest in advancing their ability to implement commitments i-iv, including risk assessment and identification, thresholds definition, and mitigation effectiveness. This should include processes to assess and monitor the adequacy of mitigations, and identify additional mitigations as needed to ensure risks remain below the pre-defined thresholds. They will contribute to and take into account emerging best practice, international standards, and science on AI risk identification, assessment, and mitigation.

  1. Describe how the safety framework will be updated. If possible, describe how each of the above sections will be updated, i.e.:
    1. How will new risk categories be added to the framework?
    2. How will new risk assessments be designed and used?
    3. How will risk thresholds be updated over time?
    4. How will internal procedures be changed as AI progress advances?
  2. Ensure that the process of updating mitigations is thoroughly described, as the commitments pay a particular focus to mitigations.
    1. How and when are mitigations tested to ensure that they continue to be effective on more advanced models?
    2. How are new mitigations identified and evaluated?
      1. Both these sections will likely draw heavily on already-established processes described for Commitment 3.
  3. Affirm that updates will take into account the various sources listed in the commitments.

"The Preparedness team is responsible for:

  1. maintaining and updating the Scorecard, including designing and running evaluations to provide Scorecard inputs and collecting relevant information on monitored misuse, red-teaming, and intelligence
  2. monitoring for unknown unknowns and making the case for inclusion in the Preparedness Framework of any new risk categories as they emerge
  3. [... additional ways in which the Preparedness Framework will be updated]

If the Preparedness or any other team determines that any changes to the Preparedness Framework are necessary, it will include a case for this change in its report. The case will consist of the suggested new version of the relevant parts of the Preparedness Framework along with a summary of evidence supporting the change (and evidence against). This case is then sent to SAG and processed according to the standard decision-making process described below." (pg. 23)

"Policy changes: Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust. The current version of the RSP is accessible at www.anthropic.com/rsp. We will update the public version of the RSP before any changes take effect and record any differences from the prior draft in a change log." (pg. 12)

"Issues that we aim to address in future versions of the Framework include:

  • Greater precision in risk modeling: Given the nascency of the underlying science, there is significant room for improvement in understanding the risks posed by models in different domains, and refining our set of CCLs. We also intend to take steps to forecast the arrival of CCLs to inform our preparations.
  • [...]
  • Mitigation plans: Striking a balance between mitigating risks and fostering access and innovation is crucial, and requires consideration of factors like the context of model development, deployment, and productization. As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described.
  • Updated set of risks and mitigations: There may be additional risk domains and critical capabilities that fall into scope as AI capabilities improve and the external environment changes."

(pg. 6)

We aim to update the framework piecewise, as many individual components can be updated in isolation, or with minimal impact on the remainder of the framework. We will briefly describe update processes for each component of the safety framework:

  • Identified risks:
    • An identified risk vector will be removed only by a supermajority vote of the [...]
    • A new identified risk vector will be added by the same approval process as already identified risk vectors. When a new risk vector is added, we will immediately task our risk assessments, risk thresholds, and mitigations teams with updating their procedures for the new risk vector.
  • Risk assessments:
    • Risk assessments are in active development, and so most changes here will not be documented in the safety framework.
    • Changes to assessment strategy (e.g. better capability elicitation techniques, changes to how mitigations are integrated in the testing process), assessment schedule, and third party assessment utilization will be documented in the safety framework.
    • Assessment categories in the safety framework will be added or removed at the request of the assessment team lead.
  • Risk thresholds: [...]
  • Risk mitigations:
    • Our security and safety mitigations will be continuously assessed by internal red-teaming and external vulnerability-reporting programs.
    • If our mitigations are found to be inadequate, we will pause deployment until an adequate set of mitigations can be found, implemented, and verified to reduce risk below desired thresholds, following the processes described above.
    • We will collaborate with trusted actors such as AI Safety Institutes and related institutions to identify and implement such mitigations as necessary.

All teams are expected to follow current best practices and scientific advancements in their fields and suggest safety framework changes as those best practices evolve. Additionally, we commit to reviewing and updating the safety framework after all relevant national and international laws and agreements that we are a party to.

Commitment 6 (Governance)

Frontier AI Safety Commitments, Commitment 6

VI. Adhere to the commitments outlined in I-V, including by developing and continuously reviewing internal accountability and governance frameworks and assigning roles, responsibilities and sufficient resources to do so.

  1. Identify key actors in the organization that are required for the safety framework to be successfully implemented. This involves, at a minimum:
    1. Actors involved in risk assessment.
    2. Actors involved in setting risk thresholds.
    3. Actors involved in identifying and implementing risk mitigations.
    4. Actors involved in making decisions about whether models are safe to develop or deploy.
    5. Actors involved in updating and maintaining the safety framework.
    6. Actors that ensure that the organization is adhering to the safety framework.
  2. Describe each of these groups, and their responsibilities with regards to the commitments made within the safety framework.

"We also establish an operational structure to oversee our procedural commitments. These commitments aim to make sure that: (1) there is a dedicated team 'on the ground' focused on preparedness research and monitoring (Preparedness team), (2) there is an advisory group (Safety Advisory Group) that has a sufficient diversity of perspectives and technical expertise to provide nuanced input and recommendations, and (3) there is a final decision-maker (OpenAI Leadership, with the option for the OpenAI Board of Directors to overrule).

Parties in the Preparedness Framework operationalization process:

  • The Preparedness team conducts research, evaluations, monitoring, forecasting, and continuous updating of the Scorecard with input from teams that have relevant domain expertise.
  • The Safety Advisory Group (SAG), including the SAG Chair, provides a diversity of perspectives to evaluate the strength of evidence related to catastrophic risk and recommend appropriate actions. The SAG will strive to recommend mitigations that are as targeted and non-disruptive as possible while not compromising safety. In particular, we recognize that pausing deployment or development would be the last resort (but potentially necessary) option in these circumstances.
  • The OpenAI Leadership, i.e., the CEO or a person designated by them, serves as the default decision-maker on all decisions.
  • The OpenAI Board of Directors (BoD), as the ultimate governing body of OpenAI, will oversee OpenAI Leadership's implementation and decision-making pursuant to this Preparedness Framework. The BoD may review certain decisions taken and will receive appropriate documentation (i.e., without needing to proactively ask) to ensure the BOD is fully informed and able to fulfill its oversight role.

Process:

  • The Preparedness team is responsible for:
    • maintaining and updating the Scorecard, including designing and running evaluations to provide Scorecard inputs and collecting relevant information on monitored misuse, red-teaming, and intelligence
    • monitoring for unknown unknowns and making the case for inclusion in the Preparedness Framework of any new risk categories as they emerge
    • [...]
  • [...]"

(pg. 22-24)

"To facilitate the effective implementation of this policy across the company, we commit to the following:

  1. Responsible Scaling Officer: We will maintain the position of Responsible Scaling Officer, a designated member of staff who is responsible for reducing catastrophic risk, primarily by ensuring this policy is designed and implemented effectively. The Responsible Scaling Officer's duties will include (but are not limited to): (1) as needed, proposing updates to this policy to the Board of Directors; (2) approving relevant model training or deployment decisions based on capability and safeguard assessments; (3) reviewing major contracts (i.e., deployment partnerships) for consistency with this policy; (4) overseeing implementation of this policy, including the allocation of sufficient resources; (5) receiving and addressing reports of potential instances of noncompliance; (6) promptly notifying the Board of Directors of any cases of noncompliance that pose material risk; and (7) making judgment calls on policy interpretation and application.
  2. Readiness: We will develop internal safety procedures for incident scenarios. Such scenarios include (1) pausing training in response to reaching Capability Thresholds; (2) responding to a security incident involving model weights; and (3) responding to severe jailbreaks or vulnerabilities in deployed models, including restricting access in safety emergencies that cannot otherwise be mitigated. We will run exercises to ensure our readiness for incident scenarios.
  3. [...additional internal governance measures...]"

(pg. 11-12)

[no explicit descriptions of how the framework will be internally implemented]

[we cannot give an example for this commitment, as it is too dependent on the particular structure of your organization]

Commitment 7 (Transparency)

Step 1: Prepare Framework for Publication

Frontier AI Safety Commitments, Commitment 7

VII. Provide public transparency on the implementation of the above (I-VI), except insofar as doing so would increase risk or divulge sensitive commercial information to a degree disproportionate to the societal benefit...

  1. Step through the framework section-by-section.
  2. In each section, identify any text that would leak commercially-valuable information without substantial benefit.
    1. For example: details on specific security assessments and mitigations used, which could enable adversaries to better circumvent security systems.
    2. As a counter-example of information that is commercially-valuable, but still potentially worth sharing: Information on which safety-mitigations are used could potentially be intellectual property, but sharing it could still yield substantial social benefit, as it would enable outside researchers to study the methods more and identify flaws and room for improvement.
  3. Produce a version of the framework without the identified information.
  4. Publish the sanitized framework.

"Internal visibility: The Preparedness Framework, reports and decisions will be documented and visible to the BoD and within OpenAI (with redactions as needed given internal compartmentalization of research work). This also includes any audit trails created from the below." (pg. 24)

[no mention of redactions, but this isn't a requirement: as the document is published, it satisfies this part of the commitment]

[no mention of redactions, but this isn't a requirement: as the document is published, it satisfies this part of the commitment]

[this commitment is not fulfilled by any section of the safety framework, but rather by the safety framework being published]

Step 2: Enable Trusted Access

Frontier AI Safety Commitments, Commitment 7

VII. [...] They should still share more detailed information which cannot be shared publicly with trusted actors, including their respective home governments or appointed body, as appropriate.

  1. Similarly to before, step through the framework and identify any text that cannot be shared even with highly trusted actors.
    1. For example: specific training details that have no bearing on the safety properties of a model.
  2. Produce a version of the framework without the identified information.
  3. Identify which actors are trusted to view the more complete safety framework.
  4. Affirm that trusted actors can gain access to this new version of the safety framework.

[no explicit descriptions of how trusted actors can access additional framework details]

"We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity." (pg. 12; footnote)

"U.S. Government notice: We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard." (pg. 13)

[no explicit descriptions of how trusted actors can access additional framework details]

While we have removed some information from the public version of this safety framework to preserve our organization's interest, we continue to collaborate closely with our partner organizations and our country's government in developing the framework. Our collaborators have access to a version of the safety framework with all safety-relevant details present.

To request access to the safety framework with full information about our safety measures, contact sfcollaborators@example.com.

Commitment 8 (Feedback Opportunities)

Frontier AI Safety Commitments, Commitment 8

VIII. Explain how, if at all, external actors, such as governments, civil society, academics, and the public are involved in the process of assessing the risks of their AI models and systems, the adequacy of their safety framework (as described under I-VI), and their adherence to that framework.

  1. Create a way for external actors to give feedback and describe it in the public-facing safety framework.
  2. Assign personnel to read the feedback and send summaries of its key contents for relevant people in the organization.
[no explicit descriptions of how external actors can give feedback]

"We actively welcome feedback on our policy and suggestions for improvement from other entities engaged in frontier AI risk evaluations or safety and security standards. To submit your feedback or suggestions, please contact us at rsp@anthropic.com." (pg. 2)

"Procedural compliance review: On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy's main procedural commitments (we expect to iterate on the exact list since this has not been done before for RSPs). This review will focus on procedural compliance, not substantive outcomes. We will also do such reviews internally on a more regular cadence." (pg. 13)

[no explicit descriptions of how external actors can give feedback]

If you have feedback on this safety framework or want to comment on how well we're sticking to the procedures and commitments we've laid out here, please contact us at safetyframework@example.com.