Behind the OpenLLM Leaderboard: The Evaluation Harness
Evaluate quantized LLMs and LoRA adapters on your computer
When a new large language model (LLM) is released, it is now common to see it submitted and ranked on the OpenLLM leaderboard. An LLM ranked first on this leaderboard has a good chance to be “trending” on the Hugging Face hub for the following weeks.
In the background, the OpenLLM leaderboard is simply running the Evaluation Harness, an open-source framework by EleutherAI that can evaluate LLMs on many public benchmarks. MMLU, HellaSwag, TruthfulQA, Lambada, etc. are all available in the Evaluation Harness.
In this article, I present the Evaluation Harness. In particular, we will see how we can use it to check locally the performance of quantized LLMs and LoRA adapters. If the LLM you want to evaluate runs on your computer, you can also easily evaluate it with the Evaluation Harness.
Here is a notebook running the Evaluation Harness for quantized LLMs and LoRA adapters, using Llama 2 7B for the examples: