metr.org Open in urlscan Pro
2606:4700:3035::6815:33c5  Public Scan

Submitted URL: http://metr.org/
Effective URL: https://metr.org/
Submission: On February 19 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

 * Home
 * Team
 * Work With Us
 * Donate
 * Blog


 
 
 * Home
 * Team
 * Work With Us
 * Donate
 * Blog

Menu


METR

Model Evaluation and Threat Research

December 2023 update


ARC EVALS IS NOW METR

After incubating as a project of the Alignment Research Center we are spinning
out as a standalone non-profit, METR – Model Evaluation & Threat Research
(pronounced ‘meter’). More here.




OUR WORK

METR works on assessing whether cutting-edge AI systems could pose catastrophic
risks to society.

We currently partner with Anthropic and OpenAI to evaluate their AI systems, and
are exploring other partnerships as well.

In the future, we might do things like offering certification to companies that
demonstrate commitment to ensuring their AI systems are safe before building or
deploying them, in line with voluntary safety standards we propose.

We believe that AI could change the world quickly and drastically, with
potential for both enormous good and enormous harm. We also believe it’s hard to
predict exactly when and how this might happen.

Rather than advocate for “faster” or “slower” AI progress, we aim to accurately
assess risks, so that AI can be used when it’s clearly safe to do so.

We are a small team, and our work is early-stage and experimental. Our methods
have a long way to go before they can provide acceptable assurance of safety for
powerful AI systems. Ensuring the safety of future systems will require
assessing a broader range of characteristics of a model and the environment in
which it’s built and deployed - security, monitoring, controls, alignment - as
well as covering a wider range of dangers and threat models than we are
currently watching for.


AI’S TRANSFORMATIVE POTENTIAL

We believe that at some point, AIs will probably be able to do most of what
humans can do, including developing new technologies; starting businesses and
making money; finding new cybersecurity exploits and fixes; and more.

We think it’s very plausible that AI systems could end up misaligned: pursuing
goals that are at odds with a thriving civilization. This could be due to a
deliberate effort to cause chaos [1] 1, or (more likely) may happen in spite of
efforts to make systems safe [2] 2. If we do end up with misaligned AI systems,
the number, capabilities and speed of such AIs could cause enormous harm -
plausibly a global catastrophe. If these systems could also autonomously
replicate and resist attempts to shut them down, it would seem difficult to put
an upper limit on the potential damage.

Given how quickly things could play out, we don’t think it’s good enough to
“wait and see” whether there are dangers.

We believe in vigilantly, continually assessing risks. If an AI brings
significant risk of a global catastrophe, the decision to develop and/or release
it can’t lie only with the company that creates it.


“AUTONOMOUS REPLICATION” EVALUATIONS: WHAT WE DO TODAY

We currently focus on evaluating cutting-edge AI models to test their
capabilities for autonomous replication.

“Autonomous replication” capabilities means an AI could survive on a cloud
server, obtain money and compute resources, and use those resources to make more
copies of itself.

We test to see whether AIs are capable of relatively basic tasks that seem
necessary for autonomous replication. Here is an example workflow we might go
through while testing a model:

 * Set the model up in a loop where it is repeatedly queried for a next action,
   and has the ability to do things like to run code on a remote server, see the
   results, control a browser, and “reason out loud.” Give it a prompt
   containing an explanation of how to use those tools as well as a task
   description (e.g. “Install a copy of the best available open-source language
   model on the server”). Have it output a next action.
 * If the action is safe, legal, etc. then “run the step” (execute the code,
   browser command, delegation, etc.) and send back the output. If the next step
   could cause real-world harm (or has other issues such as a long time delay),
   simulate the result of the action.
   * For example, if the AI sends an email, we intervene to prevent the email
     being sent to the target, and instead send it to one of our researchers.
     The researcher then responds to the email in different ways depending on
     its content, according to the experiment instructions.
 * Notice patterns in where the model is making progress or not making progress,
   and make simple adjustments to the prompting setup to see whether the model
   is able to succeed at the task with some small changes.

From what we’ve seen so far, even the most capable public AI models don’t seem
capable of fairly basic steps toward autonomous replication. However, our
evaluations aren’t yet robust. Importantly, we haven’t yet fine-tuned models to
address some of their weaknesses. And AI capabilities continue to improve in
general.

“This AI model can’t autonomously replicate” is potentially a key part of a
safety argument: a simple, empirically verifiable case that an AI model is
unlikely to pose catastrophic risks.

If and when this is no longer true, it will be important to ask whether there is
some other good safety argument [3] 3 or whether building more capable AI
systems poses an unacceptable risk to society.

We are also thinking ahead to what future evaluations might look like - for
example, evaluations of whether AIs are reliably aligned.


SAFETY STANDARDS

We are exploring the idea of developing safety standards that AI companies might
voluntarily adhere to, and potentially be certified for.

For example, AIs that pass the “autonomous replication” threshold above might be
subjected to requirements along these lines:

 * Security: sufficiently capable models should be well-protected against theft
   attempts.
 * Monitoring: sufficiently capable models should be well-monitored so that
   unintended actions can be quickly noticed and addressed.
 * Alignment: sufficiently capable models should consistently behave in line
   with their users’ and developers’ intent, and should have low risk of doing
   things like deceiving and manipulating humans toward some unintended end.

Safety standards could make it harder for companies to casually move forward
with dangerous models; could reassure those outside the company that
catastrophic risk is unlikely; and could increase incentives for alignment
research.

We aren’t sure yet whether we’ll establish safety standards. It’s something
we’re exploring preliminarily.


ABOUT THE TEAM

The project is run by Elizabeth (Beth) Barnes. It was previously housed at the
Alignment Research Center (ARC) as ARC Evals. The project is advised by ARC CEO
Paul Christiano, and also Holden Karnofsky.


CONTACT US

Feel free to send general inquires to info@metr.org.

 1. For instance, an anonymous user set up ChatGPT with the ability to run code
    and access the internet, prompted it with the goal of ”destroy humanity,”
    and set it running.

 2. There are reasons to expect goal-directed behavior to emerge in AI systems,
    and to expect that superficial attempts to align ML systems will result in
    sycophantic or deceptive behavior - “playing the training game” - rather
    than successful alignment.

 3. Such as “The AI is restricted to certain actions, and strong security
    prevents someone from stealing it and using it for other actions,” or such
    as “The AI is demonstrably aligned.”