People use large-scale language models for a variety of tasks, from translating articles to identifying financial fraud. But despite the incredible power and versatility of these models, they sometimes produce inaccurate responses.
In addition to these problems, models can be overconfident about wrong answers or overconfident about right answers, making it difficult for users to know when they can trust the model.
Researchers typically calibrate machine learning models to ensure that their confidence level matches their accuracy. A well-calibrated model should have low confidence in incorrect predictions, and vice versa. However, since large-scale language models (LLMs) can be applied to an endless collection of diverse tasks, traditional calibration methods are ineffective.
Now, researchers at MIT and the MIT-IBM Watson AI Lab have introduced a calibration method that’s tailored to large-scale language models. Their method, called Thermometer, involves building a smaller, auxiliary model that runs on top of the large language model to calibrate it.
Thermometers are more efficient than other approaches: they require less computational power, maintain model accuracy, and can produce better-calibrated responses to previously unseen tasks.
Thermometer allows you to efficiently calibrate your LLM for a variety of tasks, helping you identify situations where your model is overly confident in its incorrect predictions, ultimately preventing you from deploying your model in situations where it might fail.
“With the Thermometer, we want to provide a clear signal to the user whether the model’s response is accurate or inaccurate in a way that reflects the model’s uncertainty, so that the user knows whether the model is trustworthy,” said Maohao Shen, an electrical engineering and computer science (EECS) graduate student and lead author of the paper on the Thermometer.
Shen worked with Gregory Wornell, Sumitomo Professor of Engineering who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory for Electronics, and lead author Soumya Ghosh, a member of the MIT-IBM Watson AI Lab and a research scientist at the MIT-IBM Watson AI Lab, and other researchers at MIT and the MIT-IBM Watson AI Lab. The research was recently presented at the International Conference on Machine Learning.
Universal correction
Traditional machine learning models are typically designed to perform a single task, so calibrating them typically involves a single task-specific method. On the other hand, LLMs are flexible enough to perform multiple tasks, so using traditional methods to calibrate the model for one task may result in poor performance on other tasks.
Calibrating LLMs often involves sampling the model multiple times to obtain different predictions, and then aggregating these predictions to obtain a better calibrated confidence level. However, since these models have billions of parameters, the computational cost of this approach quickly increases.
“In a sense, large-scale language models are universal because they can handle a variety of tasks. So we need a universal correction method that can handle a variety of tasks,” Shen says.
Using a thermometer, the researchers developed a versatile technique to efficiently calibrate LLM for new tasks by leveraging a classical calibration method called temperature scaling.
In this context, “temperature” is a scaling parameter used to adjust the model’s confidence level to its prediction accuracy. Traditionally, a labeled validation dataset of task-specific examples is used to determine the correct temperature.
Since LLMs are often applied to new tasks, obtaining labeled datasets can be nearly impossible. For example, a user who wants to deploy LLMs to answer customer questions about a new product will likely not have a dataset with such questions and answers.
Instead of using a labeled dataset, the researchers trained an auxiliary model that operates on top of the LLM to automatically predict the temperature needed to calibrate it for this new task.
They trained the Thermometer model using a labeled dataset for several representative tasks, but once trained, it can generalize to new tasks in similar categories without the need for additional labeled data.
For example, you could calibrate an LLM that answers questions about geometry or biology, using a thermometer model trained on a dataset of multiple-choice questions, including algebraic and medical problems.
“The goal is to be able to do any task, but we’re not there yet,” says Ghosh.
The thermometer model only has access to a small part of the LLM’s inner workings to predict the appropriate temperature needed to calibrate predictions for data points for a particular task.
An efficient approach
Importantly, this technique does not require multiple training runs and only slows down LLM slightly. Moreover, since temperature scaling does not change the model’s predictions, Thermometer maintains accuracy.
When the researchers compared the thermometer to several baselines across a variety of tasks, the thermometer consistently produced better-calibrated uncertainty measurements while requiring far less computation.
“As long as you train the Thermometer model on a sufficiently large number of tasks, it should be able to generalize across all new tasks, just like a large-scale language model — it’s a universal model,” Shen added.
The researchers also found that training a thermometer model on a smaller LLM can be directly applied to calibrate larger LLMs within the same family.
In the future, they want to apply Thermometer to more complex text generation tasks and apply the technique to much larger LLMs. The researchers also want to quantify the diversity and number of labeled datasets needed to train a Thermometer model so that it can generalize to new tasks.
This research was partially funded by the MIT-IBM Watson AI Lab.