Large-scale language models (LLMs) struggle to effectively utilize additional computation at test time to improve response accuracy, especially for complex tasks. Researchers are exploring ways to enable LLMs to think longer about difficult problems, similar to human cognition. This capability could potentially open up new avenues in agent and reasoning tasks, allow smaller on-device models to replace data center-scale LLMs, and provide a path toward general self-improvement algorithms with reduced human supervision. However, current approaches have shown mixed results, with some studies demonstrating that using test-time computation improves LLM output, while others show limited effectiveness for complex tasks such as mathematical reasoning. These conflicting results highlight the need for a systematic analysis of different approaches to extending test-time computation in LLMs.
Researchers have made significant progress in improving the performance of language models on mathematical reasoning tasks through a variety of approaches. These include continuous pretraining on mathematically oriented data, improving the distribution of LLM proposals through target optimization and iterative answer modification, and allowing LLMs to benefit from additional test-time computations using fine-tuned verifiers. Several methods have been proposed to augment LLMs with test-time computations, such as hierarchical hypothesis search for inductive reasoning, tool augmentation, and learning-thought tokens for more efficient use of additional test-time computations. However, the effectiveness of these methods varies depending on the specific problem and the underlying LLM used. For easy problems where the underlying LLM can generate reasonable responses, iteratively refining the initial answer through a series of modifications may be more effective. On the other hand, for more difficult problems that require exploration of a variety of high-level approaches, sampling independent responses in parallel or using tree search for process-based reward models may be more beneficial. In particular, for mathematical reasoning problems where the ground truth is unknown, the analysis of test-time computation scaling in language models remains an important area of research.
Researchers at UC Berkeley and Google DeepMind propose an adaptive “Optimized Computation” Strategy To scale test-time computation in LLM. This approach selects the most effective way to utilize additional computation based on the difficulty of the specific prompt and question. By measuring the difficulty of the question in terms of the underlying LLM, the researchers can predict the efficacy of test-time computation and implement this computational optimization strategy in practice. This adaptive allocation of test-time computation significantly improves scaling performance, outperforming the best of the N criteria while using approximately 4x less computation for both the correction and search methods. The researchers then compare the effectiveness of the improved test-time computation scaling strategy to the alternative of pretraining a larger model.
The use of additional test-time computation in LLM can be viewed through an integrated perspective of adaptively modifying the model’s predicted distribution at test time. This modification can be achieved through two main approaches: changing the proposal distribution and optimizing the verifier. To improve the proposal distribution, researchers have explored RL-inspired methods such as fine-tuning (e.g., STaR, ReSTEM) and self-criticism techniques. These approaches allow the model to iteratively criticize and modify its initial response to improve its own output at test time. Fine-tuning the model on on-policy data through best-of-N guided improvement has shown promise in complex inference tasks.
For verifier optimization, the traditional best-of-N sampling method can be improved by learning a process-based verifier or process reward model (PRM). This approach can predict the accuracy at each intermediate step of the solution, not just the final answer. By leveraging these step-by-step predictions, it is possible to perform a more efficient and effective tree search in the solution space, which can outperform the naive best-of-N sampling. These methods of modifying the proposal distribution and optimizing the verifier form two independent research axes in improving the test time calculation for language models. The effectiveness of each approach may vary depending on the specific task and model characteristics.
This approach involves selecting optimal hyperparameters for a given test-time strategy to maximize performance gains. To implement this, the researchers introduce a method to estimate question difficulty, which is used as an important factor in determining the most effective compute allocation. Question difficulty is defined using the performance of the baseline LLM, and questions are divided into five difficulty levels based on the model’s pass@1 rate. This model-specific difficulty measure has been shown to be a better predictor of test-time compute efficacy than hand-labeled difficulty bins. To make the strategy practical without relying on the actual correct answer, the researchers approximated question difficulty using the concept of model prediction based on learned verifier scores. This approach allows for assessing difficulty and selecting strategies without having to know the correct answer in advance. The validation set is then used to determine the compute-optimal strategy for each difficulty bin and applied to the test set. This method allows for adaptive allocation of test-time compute resources, which can significantly improve performance compared to uniform or ad hoc allocation strategies.
In this study, we analyze various approaches to optimize test-time compute scaling in LLM, including search algorithms using process verifiers (PRMs) and improving proposal distributions through modifications. Beam search outperforms best-of-N when the generation budget is low, but this advantage diminishes as the budget increases. Sequential modification generally outperforms parallel sampling, and the optimal ratio between the two varies with problem difficulty. Easy problems benefit more from sequential modification, while hard problems require a trade-off between sequential and parallel computing. The effectiveness of search methods varies with problem difficulty, with beam search showing improvements on medium-difficulty problems but showing signs of over-optimization on easy problems. By optimally selecting strategies based on problem difficulty and compute budget, compute-optimal scaling approaches can outperform best-of-N baselines using up to 4x less test-time compute. The study also shows that test-time compute is more beneficial for easy to medium-difficulty problems or in environments with low inference load, while pre-training is more effective for hard problems or with high inference requirements.
This study demonstrates the importance of an adaptive “computation-optimal” strategy for extending test-time computation in LLM. The researchers estimate the effect of test-time computation on problem difficulty, implementing a practical strategy that achieves the best performance among N baselines using 4x less computation. Comparing additional test-time computation with larger pretraining models, they find that for easy to moderate problems, test-time computation often outperforms increased pretraining. However, for the most difficult problems, additional pretraining is more effective. These results suggest a likely future shift toward allocating fewer FLOPs to pretraining and more FLOPs to inference, highlighting the changing landscape of LLM optimization and deployment.
Check it out paper. All credit for this research goes to the researchers on this project. Also, don’t forget to follow us. twitter And join us Telegram Channel and LinkedIn grWhoop. If you like our work, you’ll like ours. newsletter..
Don’t forget to join us 48k+ ML SubReddit
Find upcoming AI webinars here
Asjad is an intern consultant at Marktechpost. He is pursuing his Bachelor of Science in Mechanical Engineering from Indian Institute of Technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast and is always researching the applications of Machine Learning in Healthcare.