One thing that makes large-scale language models (LLMs) so powerful is the diversity of tasks to which they can be applied: the same machine learning model that helps a graduate student draft an email could also help a clinician diagnose cancer.
However, the broad applicability of these models makes them difficult to evaluate in a systematic way. It would be impossible to create a benchmark dataset to test the models on all types of questions.
In a new paper, MIT researchers take a different approach: They argue that since humans decide when to deploy large-scale language models, evaluating them requires understanding how people form beliefs about the models’ capabilities.
For example, a graduate student might decide whether the model would be helpful in writing a particular email, and a clinician might decide in which cases it would be best to refer to the model.
Building on these ideas, the researchers created a framework to evaluate LLMs based on whether they match human beliefs about how certain tasks should be performed.
They introduce a human generalization function, which is a model of how people update their beliefs about LLM’s competence after interacting with LLM. They then evaluate how well LLM matches this human generalization function.
Their results suggest that when models do not match human generalization capabilities, users may be over- or under-confident about where to deploy the model, which can lead to unexpected failures in the model. Furthermore, this mismatch can cause more capable models to perform worse than smaller models in high-stakes situations.
“These tools are interesting because they are general-purpose, but because they are general-purpose, they involve collaborating with humans, so we need to keep humans in the loop,” said study co-author Ashesh Rambachan, an assistant professor of economics and a senior research scientist in the Laboratory for Information and Decision Systems (LIDS).
Rambachan and his co-authors on the paper include Keyon Vafa, a postdoc at Harvard University, and Sendhil Mullainathan, a professor of electrical engineering and computer science and economics at MIT and a LIDS fellow. The research will be presented at the International Conference on Machine Learning.
Human generalization
As we interact with others, we form beliefs about what they know and what they don’t know. For example, if a friend is picky about correcting people’s grammar, you might generalize and assume that they are also good at sentence construction, even if you’ve never asked them about sentence construction.
“Language models often look very human, and we wanted to show that this power of human generalization is also present in the way people form beliefs about language models,” says Rambachan.
Researchers have formally defined the human generalization function as a starting point, which involves asking a question, observing how a person or LLM responds, and then inferring how that person or model would respond to the relevant question.
If someone sees that LLM can answer questions about matrix inverses correctly, they can assume that it can also answer questions about simple arithmetic. Models that do not match this function, that is, do not perform well on questions that humans are expected to answer correctly, may fail when deployed.
Based on these formal definitions, the researchers designed a survey to measure how people generalize when interacting with LLMs and other people.
They asked survey participants whether a person or LLM was right or wrong, and then asked them whether they thought that person or LLM would answer the relevant questions correctly. Through the survey, they generated a dataset of about 19,000 examples of how humans generalize about LLM performance across 79 different tasks.
Misalignment measurement
The researchers found that when people were asked whether they would answer related questions correctly after answering one question correctly, participants performed quite well, but they performed much worse at generalizing to LLM performance.
“Human generalizations apply to language models, but these language models fail because they don’t actually exhibit the same patterns of expertise that people do,” says Rambachan.
People were also more likely to update their beliefs about the LLM when they got the question wrong than when they got it right. They also tended to believe that the LLM’s performance on simple questions would have little effect on its performance on more complex questions.
In situations where people give more weight to incorrect responses, simple models outperform very large models like GPT-4.
“Better language models might fool people into thinking they’ll do better on relevant questions, but that’s not actually the case,” he says.
One reason why generalizations about LLMs are worse than human ability may stem from the novelty of LLMs: people have far less experience interacting with LLMs than with other people.
“It’s possible that it could get better in the future as we interact more with the language model,” he said.
To this end, the researchers would like to conduct further research on how people’s beliefs about LLMs evolve over time as they interact with the model, and also explore how human generalizations can be incorporated into LLM development.
“When we initially train these algorithms or try to update them with human feedback, we need to take into account the human generalization ability in our thinking about performance measures,” he says.
In the meantime, researchers hope to use their own datasets as benchmarks to compare how LLMs perform with respect to human generalization capabilities, which could help improve the performance of models deployed in real-world situations.
“For me, the contribution of this paper is twofold. The first is practical: It illuminates an important problem in deploying LLMs for general consumer use. If people don’t have a good understanding of when an LLM is correct and when it will fail, they are more likely to see mistakes and may even stop using it. This highlights the problem of aligning models based on their understanding of generalizability,” says Alex Imas, a professor of behavioral science and economics at the University of Chicago Booth School of Business, who was not involved in this work. “The second contribution is more fundamental: The lack of generalizability across the expected problem and domain helps us better understand what the model is doing when it solves the problem ‘correctly.’ We can test whether the LLM ‘understands’ the problem it is trying to solve.”
This research was funded by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.