The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Behind the OpenLLM Leaderboard: The Evaluation Harness

Behind the OpenLLM Leaderboard: The Evaluation Harness

Evaluate quantized LLMs and LoRA adapters on your computer

Benjamin Marie's avatar
Benjamin Marie
Dec 21, 2023
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Behind the OpenLLM Leaderboard: The Evaluation Harness
5
Share

When a new large language model (LLM) is released, it is now common to see it submitted and ranked on the OpenLLM leaderboard. An LLM ranked first on this leaderboard has a good chance to be “trending” on the Hugging Face hub for the following weeks.

In the background, the OpenLLM leaderboard is simply running the Evaluation Harness, an open-source framework by EleutherAI that can evaluate LLMs on many public benchmarks. MMLU, HellaSwag, TruthfulQA, Lambada, etc. are all available in the Evaluation Harness.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I present the Evaluation Harness. In particular, we will see how we can use it to check locally the performance of quantized LLMs and LoRA adapters. If the LLM you want to evaluate runs on your computer, you can also easily evaluate it with the Evaluation Harness.

Here is a notebook running the Evaluation Harness for quantized LLMs and LoRA adapters, using Llama 2 7B for the examples:

Get the notebook (#33)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share