This document is primarily aimed at signatory organizations of the Frontier AI Safety Commitments, though we hope that all AI developers will find this guide useful.
The Frontier AI Safety Commitments at the AI Seoul Summit aim to ensure that frontier AI developers (1) effectively plan to responsibly manage risk from powerful AI systems, (2) develop internal structures to hold themselves accountable for safe development and deployment, and (3) make their safety plans appropriately transparent to external actors. To this end, the 16 signatory companies agreed to publish a document called a safety framework that demonstrates the fulfillment of eight specific commitments before the upcoming AI Summit in France.
In this document, we provide a step-by-step guide to making such a safety framework. We reference the following existing safety frameworks: OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework.
Commitment | Actions Needed |
---|---|
I. Risk Assessments |
|
II. Risk Thresholds |
|
III. Mitigations |
|
IV. Mitigation Strategy |
|
V. Continual Improvement |
|
VI. Governance |
|
VII. Transparency |
|
VIII. Feedback Opportunities |
|
I. Assess the risks posed by their frontier models or systems across the AI lifecycle, including before deploying that model or system, and, as appropriate, before and during training. Risk assessments should consider model capabilities and the context in which they are developed and deployed, as well as the efficacy of implemented mitigations to reduce the risks associated with their foreseeable use and misuse. They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments, and other bodies their governments deem appropriate.
Risk vectors are specific mechanisms by which AI systems could cause harm, and include both deliberate malicious action as well as unintended consequences. Each risk vector may require a unique approach to assessment, monitoring, and mitigation.
While examining additional risk vectors can reduce the risk of a model causing harm, it also increases costs – requiring additional evaluations (including potentially more complex evaluations), more mitigation steps, and a potential risk of delaying or stopping deployment. If your organization wants to consider a broader set of harms than are listed here, we recommend looking at the AI Risk Repository, which attempts to catalog many more AI harms.
Threat Vectors | OpenAI | Anthropic | Google DeepMind |
---|---|---|---|
Model Theft | ✓ | ✓ | ✓ |
Autonomy | ✓ | ✓ | ✓ |
ML R&D | ✓ | ✓ | ✓ |
Cyber Offense | ✓ | ~ | ✓ |
Bioweapons | ✓ | ✓ | ✓ |
CBRN except Bio | ✓ | ✓ | ✗ |
Persuasion | ✓ | ✗ | ✗ |
Model Theft: Covers the organization's ability to protect model weights and other IP (training code, datasets, credentials, etc.) from outside actors who may want to steal those things.
Autonomy: Covers the model's ability to exist without human intervention – pursuing long-term goals coherently as an agent, replicating itself onto new machines, moving its weights onto new servers, generating money to pay for its own server costs, etc.
ML R&D: Covers the model's ability to perform machine learning research with goals such as improving itself or installing backdoors in other models.
Cyber Offense: Covers the model's ability to assist in general cyberattacks, fully or partially automating the discovery and exploitation of vulnerabilities against hardened or unsecured systems.
Bioweapons: Covers the model's ability to aid either amateur or expert humans in creating biological threats, such as recreating existing pathogens such as smallpox, or developing novel threats.
CBRN except Bio: Covers the model's ability to aid amateur or expert humans in developing chemical, radiological, or nuclear weaponry.
Persuasion: Covers the model's ability to manipulate human actions, for instance to execute scams, influence elections, or extract secrets.
Throughout our safety framework, we develop our risk assessments, thresholds, and mitigations with the goal of minimizing risk from an industry-standard set of domains. We consider the following risk vectors:
I. Assess the risks posed by their frontier models or systems across the AI lifecycle… Risk assessments should consider model capabilities and the context in which they are developed and deployed, as well as the efficacy of implemented mitigations to reduce the risks associated with their foreseeable use and misuse...
For each risk identified in Step 1:
"We will be building and continually improving suites of evaluations and other monitoring solutions [...] Importantly, we will also be forecasting the future development of risks, so that we can develop lead times on safety and security measures." (pg. 2)
"Our evaluations will thus include tests against these enhanced models to ensure we are testing against the “worst case” scenario we know of." (pg. 6)
"We want to ensure our understanding of pre-mitigation risk takes into account a model that is “worst known case” [...] for the given domain. To this end, for our evaluations, we will be running them not only on base models [...], but also on fine-tuned versions designed for the particular misuse vector without any mitigations in place." (pg. 13)
"To verify if mitigations have sufficiently and dependently reduced the resulting post-mitigation risk, we will also run evaluations on models after they have safety mitigations in place, again attempting to verify and test the possible “worst known case” scenario for these systems." (pg. 14)
"For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. To make the required showing, we will need to satisfy the following criteria:
(pg. 5-6)
"The capabilities of frontier models are tested periodically to check whether they are approaching a CCL. To do so, we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. (pg. 2)"
To assess the security of our model weights, we will:
To assess our model’s ability to act autonomously, we will:
To assess our model’s ability to contribute to AI R&D, we will: [...]
I. Assess the risks posed [...], including before deploying that model or system, and, as appropriate, before and during training.
"We will be running these evaluations continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough." (pg. 13)
"We will routinely test models to determine whether their capabilities fall sufficiently far below the Capability Thresholds such that we are confident that the ASL-2 Standard remains appropriate. We will first conduct preliminary assessments (on both new and existing models, as needed) to determine whether a more comprehensive evaluation is needed. The purpose of this preliminary assessment is to identify whether the model is notably more capable than the last model that underwent a comprehensive assessment.
The term "notably more capable" is operationalized as at least one of the following:
4'Effective Compute' is a scaling-trend-based metric that accounts for both FLOPs and algorithmic improvements. An Effective Compute increase of K represents a performance improvement from a pretrained model on relevant task(s) equivalent to scaling up the baseline model's training compute by a factor of K. We plan to track Effective Compute during pretraining on a weighted aggregation of datasets relevant to our Capability Thresholds (e.g., coding and science). This is, however, an open research question, and we will explore different possible methods. More generally, the Effective Compute concept is fairly new, and we may replace it with another metric in a similar spirit in the future."
(pg. 4)
"We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL." (pg. 2)
We run our set of risk assessments on any model trained with more than 10^25 training FLOPs before deployment. Additionally, if the model's projected FLOP count is >5 * 10^25 FLOPs, we will pause training at 10^25 FLOP and every subsequent 5x increase in effective compute to run our set of risk assessments.
Before any training run with >5 * 10^25 FLOP, we will conduct an internal, anonymous forecasting survey to determine the expected chance of the training run alone causing risks that exceed our thresholds. If the expected chance of exceeding any threshold from this survey exceeds 10%, we will delay training.
After deployment, we will rerun our assessments every three months, to evaluate progress from additional fine-tuning, tooling, or other improvements.
I. [...] They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments, and other bodies their governments deem appropriate.
"External access: We will also continue to enable external research and government access for model releases to increase the depth of red-teaming and testing of frontier model capabilities." (pg. 25)
"Expert input: We will solicit input from external experts in relevant domains in the process of developing and conducting capability and safeguards assessments. We may also solicit external expert input prior to making final decisions on the capability and safeguards assessments." (pg. 13)
"Future Work
(pg. 6)
We are collaborating with a number of external organizations including our home government's AI Safety Institute to effectively evaluate our model's capabilities in terms of autonomous capability, performing ML R&D, and cyber-offensive capabilities. All external risk assessments will be performed after the AI system is trained, but before deployment.
We encourage additional organizations to collaborate with us to design effective evaluations, especially in evaluating our AI system's ability to create biohazards. Prospective collaborators should contact us at frontierriskassessments@example.com.
II. Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable. Assess whether these thresholds have been breached, including monitoring how close a model or system is to such a breach. [...] They should also be accompanied by an explanation of how thresholds were decided upon, and by specific examples of situations where the models or systems would pose intolerable risk.
Risk level definitions for Cyber Offense:
"Low: Model assistance for cyberattacks, if any, is meaningful only for non-programming uses, e.g., translation, targeted distillation of confidential information, and/or data analysis AND model cannot meaningfully execute elements of cyber operations, e.g., developing a known exploit into an attack, black-box exploitation, translating reasoning about defenses into action, or lateral goal-directed movement on hardened targets.
Rationale: Non-programming tasks represent a sizable bottleneck for threat actors; however, the skills involved are not terribly specialized and the benefit comes often from intersectional skills (i.e., proficiency in English and a high level understanding of political or technical systems to summarize emails). We exclude vulnerability discovery because the harm at this stage is not identification of individual vulnerabilities, but moreso the implementation and deployment of an exploit against a target.
[...additional threshold levels omitted...]
Critical: Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal.
Rationale: Finding and executing end-to-end exploits for all software would lead to catastrophe from unilateral actors, from hacking military or industrial systems as well as OpenAI infrastructure. Novel cyber operations generally pose the most serious threat, as they are unpredictable and scarce. They may involve, e.g., novel zero-days or methods of command-and-control."
(pg. 8; minor formatting changes; additional thresholds for different risk domains on pg. 9-12)
[no explicit descriptions of intolerable risk]
"Chemical, Biological, Radiological, and Nuclear (CBRN) weapons: The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons. We assess this by comparing what potential attackers could achieve with full model access versus 2023-level online resources, assuming they have funding and up to one year of time to invest, but no initial specialized expertise.
Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world's most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024. We roughly estimate that the 2018-2024 average scaleup was around 35x per year, so this would imply an actual or projected one-year scaleup of 35^2 = ~1000x.
Model Autonomy checkpoint: The ability to perform a wide range of advanced software engineering tasks autonomously that could be precursors to full autonomous replication or automated AI R&D, and that would take a domain expert human 2-8 hours to complete. We primarily view this level of model autonomy as a checkpoint on the way to managing the risks of robust, fully autonomous systems with capabilities that might include (a) automating and greatly accelerating research and development in AI development (b) generating their own revenue and using it to run copies of themselves in large-scale, hard-to-shut-down operations"
(pg. 16)
[no explicit explanations for why given capability thresholds were chosen]
[no explicit descriptions of intolerable risk]
Critical capability levels and rationales:
"Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents.
Rationale: A model at this capability level could, if misused, pose difficult-to-predict and large-magnitude risks. Its adaptability would enable harmful activity via many means, and its ability to act autonomously and expand its effective capacity means its activity could be scaled significantly without being hindered by resource constraints. If misused or supported by well-equipped bad actors, such activity may be especially difficult to constrain.
Bio amateur enablement level 1: Capable of significantly enabling a non-expert to develop known biothreats that could increase their ability to cause severe harm compared to other means.
Rationale: Many biothreats capable of causing significant amounts of harm are currently out of the reach of non-experts because of lack of knowledge about their potential for harm and the methods of their acquisition and misuse. An LLM that helps overcome these knowledge gaps, e.g. by suggesting plausible attack strategies or providing detailed instructions for the development of a bio agent, could significantly increase society's vulnerability to fatal attacks by malicious amateurs.
Bio expert enablement level 1: Capable of significantly enabling an expert (i.e. PhD or above) to develop novel biothreats that could result in an incident of high severity.
Rationale: A very small number of bio agents have the potential to cause harm of an exceptional magnitude. The discovery of enhancements to these agents, or of agents of comparable harmfulness, could increase the chances of a very severe bio attack or accident."
[thresholds and rationales for additional risk domains]
(pg. 5-6; minor formatting changes)
[no explicit descriptions of intolerable risk]
We consider having any security level less than SL-2 (from the Securing AI Model Weights report) to be intolerable. Where SL-1 and below correspond to organizations potentially compromisable by amateur attackers, we set SL-2 and above as necessary in order to prevent solo amateurs and professionals alike from penetrating our systems. We will increase our security level as necessary as our systems' capabilities grow.
In terms of the AI systems themselves, we consider the system to be unacceptably dangerous if it:
Some example outcomes we wish to avert by setting these risk thresholds:
II. [...] These thresholds should be defined with input from trusted actors, including organisations’ respective home governments as appropriate. They should align with relevant international agreements to which their home governments are party.
"Overall, our decision to prioritize the capabilities in the two tables above is based on commissioned research reports, discussions with domain experts, input from expert forecasters, public research, conversations with other industry actors through the Frontier Model Forum, and internal discussions." (pg. 4)
We have set these thresholds using best practices set by our home government, and taking into account all feedback they have given in the process.
Additionally, we incorporated feedback from our shareholders, employee surveys, and a survey of the general public.
To the best of our knowledge, these risk thresholds align with all present international agreements in our country.
III. Articulate how risk mitigations will be identified and implemented to keep risks within defined thresholds, including safety and security-related risk mitigations such as modifying system behaviours and implementing robust security controls for unreleased model weights.
Ideally, a final product would be able to map, for every risk vector identified, which mitigations are deployed in order to keep that risk vector in check.
"Our mitigation strategy will involve both containment measures, which help reduce risks related to possession of a frontier model, as well as deployment mitigations, which help reduce risks from active use of a frontier model. As a result, these mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners." (pg. 14)
"This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team.
This might require:
(pg. 20)
[no explicit descriptions of how risk mitigations are identified/implemented]
"When a model must meet the ASL-3 Deployment Standard, we will evaluate whether the measures we have implemented make us robust to persistent attempts to misuse the capability in question. To make the required showing, we will need to satisfy the following criteria:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers' attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
To make the required showing, we will need to satisfy the following criteria:
(pg. 7-9)
[no explicit descriptions of how risk mitigations are identified/implemented]
"Security Mitigations
Deployment Mitigations
(pg. 3-4)
[no explicit descriptions of how risk mitigations are identified/implemented]
For security-related mitigations, we start from the baseline provided for SL-2 (or higher, as necessary) as defined in the Securing AI Model Weights report. All associated measures will be implemented before training begins for a model expected to require a given set of measures.
Additional security measures can be suggested by any member of the security team, or by any member in the organization and seconded by a member of the security team. A subset of team members will be tasked with keeping up-to-date with developments in cybersecurity and physical security. We expect to primarily identify new security measures via results of our periodic red-teaming security assessments. Identified measures will be evaluated for feasibility and marginal value before being implemented.
For safety-related mitigations, we start from a standard set of safety-mitigations:
We then follow a similar procedure as with security: any member of the safety team may propose additional mitigations, a subset of team-members will be tasked with keeping up with developments in the field, and any identified measures will be evaluated for feasibility and value before implementation.
All mitigations will be implemented by the corresponding team. Security assessments will be unit-tested to the extent possible, and safety mitigations will be verified to be working on toy models with at least 1/100th the parameter count of the model they are intended to be deployed on.
Risk assessment details will not be shared with members of risk-mitigation teams, to ensure validity of assessments. Mitigations will be evaluated by whether they effectively cause risk assessment results to drop below pre-defined risk thresholds.
IV. Set out explicit processes they intend to follow if their model or system poses risks that meet or exceed the pre-defined thresholds...
"If we reach (or are forecasted to reach) at least "high" pre-mitigation risk in any of the considered categories: we will ensure that our security is hardened in a way that is designed to prevent our mitigations and controls from being circumvented via exfiltration (by the time we hit "high" pre-mitigation risk). This is defined as establishing network and compute security controls designed to help prevent the captured risk from being exploited or exfiltrated, as assessed and implemented by the Security team." (pg. 20)
"If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows: [...]
If, however, we determine we are unable to make the required showing, we will act as though the model has surpassed the Capability Threshold. This means that we will (1) upgrade to the ASL-3 Required Safeguards (see Section 4) and (2) conduct follow-up a capability assessment to confirm that the ASL-4 Standard is not necessary (see Section 5)." (pg. 6-7)
"If, after the evaluations above, we determine that we have met the ASL-3 Required Safeguards, then we may proceed with deploying and training models above the Capability Threshold, provided we have also conducted a follow-up capability assessment. The process for determining whether we have met the ASL-3 Required Safeguards is as follows: [...]" (pg. 9-10)
"When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." (pg. 2)
If an AI model, before any safety mitigations are applied, triggers our intolerable-risk thresholds on any risk vector, we will delay training until security measures sufficient to SL-4 have been implemented.
If, upon finishing training including all planned safety mitigations, a model still triggers intolerable-risk thresholds, we will delay deployment of the model and subject it to more aggressive assessment. All mitigations applied will be analyzed to determine if they failed or were inadequate at preventing the undesired behaviors. We will design new mitigations as appropriate, and, if they robustly reduce risk below the target thresholds, add them to our set of standard mitigations.
Additionally, we raise our standard security level to SL-3 if:
IV. [...] This includes processes to further develop and deploy their systems and models only if they assess that residual risks would stay below the thresholds. In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.
"Only models with a post-mitigation score of 'medium' or below can be deployed. In other words, if we reach (or are forecasted to reach) at least 'high' pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit 'high' pre-mitigation risk) until there are reasonably mitigations in place for the relevant post-mitigation risk level to be back at most to 'medium' level. (Note that a potentially effective mitigation in this context could be restricting deployment to trusted parties.)
Only models with a post-mitigation score of 'high' or below can be developed further. In other words, if we reach (or are forecasted to reach) 'critical' pre-mitigation risk along any risk category, we commit to ensuring there are sufficient mitigations in place for that model (by the time we reach that risk level in our capability development, let alone deployment) for the overall post-mitigation risk to be back at most to 'high' level. Note that this should not preclude safety-enhancing development. We would also focus our efforts as a company towards solving these safety challenges and only continue with capabilities-enhancing development if we can reasonably assure ourselves (via the operationalization processes) that it is safe to do so.
Additionally, to protect against 'critical' pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate 'critical'-risk-level tasks unless explicitly instructed to do so." (pg. 21)
"In any scenario where we determine that a model requires ASL-3 Required Safeguards but we are unable to implement them immediately, we will act promptly to reduce interim risk to acceptable levels until the ASL-3 Required Safeguards are in place:
(pg. 10-11)
"A model may reach evaluation thresholds before mitigations at appropriate levels are ready. If this happens, we would put on hold further deployment or development, or implement additional protocols (such as the implementation of more precise early warning evaluations for a given CCL) to ensure models will not reach CCLs without appropriate security mitigations, and that models with CCLs will not be deployed without appropriate deployment mitigations." (pg. 2)
If a model demonstrates high degrees of capability on any risk vector and a modest degree of autonomy specifically, we will stop training until suitable in-training mitigations can be developed and applied. Furthermore, for models that trigger our intolerable risk thresholds, unless our red-team is completely unable to elicit the model's capabilities in that domain given white-box access to the model, we will not deploy the model.
V. Continually invest in advancing their ability to implement commitments i-iv, including risk assessment and identification, thresholds definition, and mitigation effectiveness. This should include processes to assess and monitor the adequacy of mitigations, and identify additional mitigations as needed to ensure risks remain below the pre-defined thresholds. They will contribute to and take into account emerging best practice, international standards, and science on AI risk identification, assessment, and mitigation.
"The Preparedness team is responsible for:
If the Preparedness or any other team determines that any changes to the Preparedness Framework are necessary, it will include a case for this change in its report. The case will consist of the suggested new version of the relevant parts of the Preparedness Framework along with a summary of evidence supporting the change (and evidence against). This case is then sent to SAG and processed according to the standard decision-making process described below." (pg. 23)
"Policy changes: Changes to this policy will be proposed by the CEO and the Responsible Scaling Officer and approved by the Board of Directors, in consultation with the Long-Term Benefit Trust. The current version of the RSP is accessible at www.anthropic.com/rsp. We will update the public version of the RSP before any changes take effect and record any differences from the prior draft in a change log." (pg. 12)
"Issues that we aim to address in future versions of the Framework include:
(pg. 6)
We aim to update the framework piecewise, as many individual components can be updated in isolation, or with minimal impact on the remainder of the framework. We will briefly describe update processes for each component of the safety framework:
All teams are expected to follow current best practices and scientific advancements in their fields and suggest safety framework changes as those best practices evolve. Additionally, we commit to reviewing and updating the safety framework after all relevant national and international laws and agreements that we are a party to.
VI. Adhere to the commitments outlined in I-V, including by developing and continuously reviewing internal accountability and governance frameworks and assigning roles, responsibilities and sufficient resources to do so.
"We also establish an operational structure to oversee our procedural commitments. These commitments aim to make sure that: (1) there is a dedicated team 'on the ground' focused on preparedness research and monitoring (Preparedness team), (2) there is an advisory group (Safety Advisory Group) that has a sufficient diversity of perspectives and technical expertise to provide nuanced input and recommendations, and (3) there is a final decision-maker (OpenAI Leadership, with the option for the OpenAI Board of Directors to overrule).
Parties in the Preparedness Framework operationalization process:
Process:
(pg. 22-24)
"To facilitate the effective implementation of this policy across the company, we commit to the following:
(pg. 11-12)
[no explicit descriptions of how the framework will be internally implemented]
[we cannot give an example for this commitment, as it is too dependent on the particular structure of your organization]
VII. Provide public transparency on the implementation of the above (I-VI), except insofar as doing so would increase risk or divulge sensitive commercial information to a degree disproportionate to the societal benefit...
"Internal visibility: The Preparedness Framework, reports and decisions will be documented and visible to the BoD and within OpenAI (with redactions as needed given internal compartmentalization of research work). This also includes any audit trails created from the below." (pg. 24)
[no mention of redactions, but this isn't a requirement: as the document is published, it satisfies this part of the commitment]
[no mention of redactions, but this isn't a requirement: as the document is published, it satisfies this part of the commitment]
[this commitment is not fulfilled by any section of the safety framework, but rather by the safety framework being published]
VII. [...] They should still share more detailed information which cannot be shared publicly with trusted actors, including their respective home governments or appointed body, as appropriate.
[no explicit descriptions of how trusted actors can access additional framework details]
"We currently expect that if we do not deploy the model publicly and instead proceed with training or limited deployments, we will likely instead share evaluation details with a relevant U.S. Government entity." (pg. 12; footnote)
"U.S. Government notice: We will notify a relevant U.S. Government entity if a model requires stronger protections than the ASL-2 Standard." (pg. 13)
[no explicit descriptions of how trusted actors can access additional framework details]
While we have removed some information from the public version of this safety framework to preserve our organization's interest, we continue to collaborate closely with our partner organizations and our country's government in developing the framework. Our collaborators have access to a version of the safety framework with all safety-relevant details present.
To request access to the safety framework with full information about our safety measures, contact sfcollaborators@example.com.
VIII. Explain how, if at all, external actors, such as governments, civil society, academics, and the public are involved in the process of assessing the risks of their AI models and systems, the adequacy of their safety framework (as described under I-VI), and their adherence to that framework.
"We actively welcome feedback on our policy and suggestions for improvement from other entities engaged in frontier AI risk evaluations or safety and security standards. To submit your feedback or suggestions, please contact us at rsp@anthropic.com." (pg. 2)
"Procedural compliance review: On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy's main procedural commitments (we expect to iterate on the exact list since this has not been done before for RSPs). This review will focus on procedural compliance, not substantive outcomes. We will also do such reviews internally on a more regular cadence." (pg. 13)
If you have feedback on this safety framework or want to comment on how well we're sticking to the procedures and commitments we've laid out here, please contact us at safetyframework@example.com.