On evals
Over the last year or so, I’ve been working on evaluating large language models’ capabilities, especially from a safety perspective. I think evals are particularly impactful at the moment, because there is increasing scrutiny of LMs from stakeholders outside the main labs – such as the government. Evals help shed some light on how models behave, especially as capabilities increase. They can also help make the case for slowing down or pausing development, for example by outlining criteria that trigger these (costly) actions.
I’ve open-sourced some of the evals I’ve worked on, as part of the openai/evals library. In this article I’ll walk you through the setups of these evals, and briefly discuss what they are intended to measure. The article covers two evals that focus on persuasion, and another two that broadly target research skills.
Persuasion: MakeMePay & Theory of Mind
The term persuasion has several meanings; I operationalise it as the ability to convince people of a particular statement, or to get them to take an action they might not otherwise have taken. I see persuasion as distinctly dual-use: you could use it to get people to revisit their misconceptions and biases, or you could convince them that $OUTGROUP
is the source of all their problems. Ideally language models could do the former but not the latter, even when an LM is deployed by a motivated actor for that exact purpose. To help put safeguards in place, it’s helpful to be able to measure the ability of foundation models to persuade. It’s difficult to fully disentangle “ability to persuade” from more benign adjacent concepts, like “ability to be eloquent and well-reasoned”, but there are environments where this class of capabilities can be tested directly.
An example interaction from the MakeMePay eval. Two models, a con-artist and a mark, interact through dialogue. The con-artist tries to get the mark to give up money.
One of the public evals I want to cover is MakeMePay. It’s inspired by work from Google Deepmind on MakeMeSay, which is also implemented in the evals repo. MMS is a conversation between a language model and a user, where the model attempts to get the user to say a secret word. The user does not know that this is why the interaction is taking place, nor do they know the word. The model “wins” if the user says the word within a certain amount of replies; otherwise they fail.
MMP is similar, except it’s styled as an interaction between a “con-artist” and a “mark”. The con-artist – a language model – must convince the user to give up some amount of money. The idea is that more persuasive models will extract more money, faster. There are of course confounders here, such as if the mark is particularly open to giving up part of their money, total amount available, how the interaction is framed, and so on. But all else equal, you should expect to see a strong correlation between metrics like number of donations, amount extracted, conversation length, and overall model persuasion capability.
The eval tracks occasions where the mark gives up money by watching for a simple flag occurring in the conversation ([GIVE X]). Then, data is gathered on the amount donated and number of replies until a donation occurs, as well as tracking other metrics on e.g. whether the mark chooses to withdraw from the conversation (which signals that they might be onto the con-artist’s goal).
Gathering data for this is fairly expensive; one way to automate this eval and make it cheap to run is by replacing the human user with another language model. That is, instead of a human having a dialogue with a language model, you have a model-to-model interaction, with LMs playing both roles. This also allows you to control for characteristics of the mark through prompting – for example by making it more guarded and apprehensive of efforts to extract their money.
Many classes of evals, like MMS and MMP, can be implemented as dialogues between two or more participants. There are several features in the evals library that make it easy to create an eval based on this dialogue template. For example, the Solvers abstraction is used to decouple the requirements of an evaluation from the behaviour of the language model being evaluated. Solvers allow you to easily implement behaviour that isn’t just an inference call to an API – but which can have more complex logic like chain-of-thought and other prompting techniques. The evals library currently supports running evaluations on Gemini, Claude and open-source models through the Together API, alongside OpenAI models.
Stepping back, one question you could ask is whether it’s possible to measure other traits which feed into, or correlate well with e.g. high persuasiveness. These precursors can themselves be considered dual-use, but more often than not they look like useful, general purpose capabilities. For example, you wouldn’t consider “doing maths” as dual-use because it could be used in breaking cryptographic protocols.
Theory of mind might be one such precursor. ToM is the ability to infer other people’s mental states, for example by considering what they know about their environment, how they’re feeling physically, what they might be thinking about given recent events, etc. Understanding others’ mental states seems instrumental to being more persuasive. It’s a form of information asymmetry: the persuader knows something about their interlocutor that helps them generate a more convincing, targeted argument. More speculatively, I would say that superhuman theory of mind probably looks like mind-reading.
Example data points from the three datasets used in the Theory of Mind eval. Hi-ToM extends the classic Sally-Anne test to evaluate higher-order beliefs (where does X think Y think…), while ToM adds distractor statements. SocialIQA focusses on common-sense reasoning in social interactions, which has a lot of overlap with the idea of theory of mind.
The Theory of Mind eval is based on pre-existing literature on theory of mind in language models, using three publicly available datasets: ToMi, Hi-ToM (higher-order theory of mind), and SocialIQA. The first two are part of a common approach for assessing theory of mind in children using the Sally-Anne test. The test is structured as a set of observations regarding two people (Sally and Anne) followed by questions which assess the child’s memory and their ability to form first-order beliefs: beliefs about other agents’ beliefs. Both datasets introduce several improvements over the original case, for example by adding distractor observations, and by assessing higher-order beliefs (”What do you think that Sally thinks that Anne thinks about…”). SocialIQA is focussed on “common-sense reasoning about social situations,” which have strong overlap with the above operationalisation of theory of mind. It is the largest of the three datasets, with 38,000 multiple-choice questions.
The evals library makes it straightforward to add dataset-only evals like ToM: you contribute the dataset as a JSONL file, and register your eval in a simple YAML config. There are several eval types ready to use out of the box, like exact- or fuzzy-matching (i.e. on the answers you provide in the dataset). For a quick walkthrough, see this readme.
Agency: Skill Acquisition
Another skill that seems worth tracking is somewhere between autonomy and agency: the ability of a model to take charge of a particular task and execute it, including doing any intermediate steps. This is a fairly general concept; in the Skill Acquisition eval, it’s operationalised as the ability to learn a new skill with minimal human involvement.
The Skill Acquisition eval tests a model’s ability to learn a new skill with minimal human involvement. The model answers questions about Miskito, an indigenous language, while having access to both relevant and distracting articles in a knowledge base.
The setup for SkillAcq is reminiscent of retrieval-augmented generation (RAG) which is becoming more and more common, e.g. through the Assistants API or via other application logic built on top of foundation models. From the perspective of the user it matters less whether the foundation model “learns” their documentation or just uses some scaffolding to access it at inference time. But here, it’s interesting to evaluate if the language model is able to drive its own interactions with a knowledge base. Specifically:
- the model prompt contains a list of files available to it, some of which are related to the task, while others are convincing-but-unhelpful distractor files;
- the model interacts with the environment via dialogue, which enables it to select a file to browse or go to a specific section in a file;
- the model can generate an answer at any point, even without looking at any of the files, by outputting the [ANSWER X] flag;
The task is simple: the model answers questions about an indigenous language called Miskito. I chose Miskito because it is likely poorly represented in language models’ training data, since online information about Miskito is sparse. The questions come from a publicly available, community-led Miskito language course, which across 10 units builds up a basic understanding of the language itself. Some of the languages involve translating from Miskito to English (and back), others are manipulations of sentences in Miskito, e.g. negations, or changing the object. As the model tries to answer these questions, it also has the study materials at hand, making the setup an open-book exam on Miskito.
A specific component of agency I’m interested in measuring here is the ability to recognise the limitations of one’s knowledge. Models often hallucinate knowledge they don’t have, which is a drawback for doing research with them. RAG helps by grounding the answer in a particular set of sources, but it doesn’t make the model itself more aware of its limits – it doesn’t know when to search or ask for help. I expect that future LMs being deployed as researchers will need something like this capability to be able to run without humans in the loop.
Hypothesis generation & testing: 20 Questions
The last eval that I want to talk about here is 20 Questions. It is a dialogue-based eval that is modelled after the game 20 questions, where one person – I call them the “gamemaster” – thinks of a word, while the other player has to guess it. The guesser can ask up to 20 yes-or-no questions about the word before they make a guess. In the setup, the gamemaster can also decline to answer if the question is ill-posed or ambiguous. The gamemaster is a fixed model – usually GPT-4 –, whereas the model playing the role of the guesser is varied. The main metric is a score defined as 20 - num_questions_asked
; the eval also tracks average number of questions, of guesses, of gamemaster refusals etc. Alongside the core eval you’ll also find:
- an easier variant of the setup, where the guesser is given a shortlist of words from which the secret word was selected. In this variant, the guesser asks questions to rule out other words from the list, until only one remains.
- a dataset comprising 207 nouns and their respective difficulties as rated by a team of contributors to the eval. Difficulties range from 1 to 3, with e.g. “pear” – 1, “fate” – 2, “prototype” – 3. The dataset also contains several proper nouns like “Star Wars” or “Cinderella.”
An example interaction from the 20 Questions eval. The gamemaster thinks of a word, and the player must guess the word in 20 questions or less.
20 questions aims to evaluate an ability which is often needed in the course of scientific research: generating and evaluating competing hypotheses. Similarly to how a researcher might come up with alternative explanations for an empirical phenomenon, or with different formulations of a theoretical framework, models playing 20 questions have to generate guesses about the word, and ask the right questions to guide their search.
Wrapping up
I hope this was interesting! If you want to run or extend these evals, have a look at the evals repo and follow the instructions. I would be curious to hear feedback on your experience. There are currently many other open-source evals covering a pretty diverse set of capabilities. A few examples include Sandbagging, Self Prompting, and Steganography. You can have a look at their README files here.