During development and fine-tuning, regular evaluation is essential to assess model improvements. Once deployed, benchmarking also helps detect performance degradation, such as after model updates or changes to inference frameworks.
Large language models (LLMs) are typically evaluated using public benchmarks such as MMLU, GPQA, Big Bench, and IFEval. These benchmarks provide insights into an LLM’s capabilities in reasoning, natural language understanding, world knowledge, and instruction following.
However, running these evaluations can be costly, especially when using default benchmark parameters and inference frameworks.
In this article, we will analyze the evaluation costs of LLMs. As examples, I will evaluate a 1.5B-parameter model and a 32B-parameter model, using RunPod (referral link), and present the total cost in USD. We will then explore strategies to significantly reduce evaluation costs while maintaining its credibility. Additionally, because even tiny changes in the evaluation hyperparameters can significantly impact the results, I will provide short guidelines for reporting results in scientific papers and technical reports.
Note: This article is not yet linked to an AI notebook. I might add one later. I’ll let you know in the Weekly Kaitchup.