The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
MMLU: Do LLMs Really Know?

MMLU: Do LLMs Really Know?

When 1. = 1. = A. = A. = a. = A) disrupts LLM's world knowledge and language understanding capabilities

Benjamin Marie's avatar
Benjamin Marie
Jan 16, 2025
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
MMLU: Do LLMs Really Know?
3
Share
Generated with ChatGPT

When released, large language models (LLMs) are evaluated on various benchmarks that test different skills, such as commonsense reasoning, coding skills, language understanding, problem-solving, and world knowledge.

Language understanding and world knowledge are frequently assessed using the MMLU benchmark (or its various adaptations), which consists of thousands of multiple-choice questions (MCQ), each with four possible answers, one of which is correct. Unlike generative benchmarks, MMLU does not evaluate the model's ability to generate text. Instead, the LLM assigns a score to each of the four answers, and if the correct answer receives the highest score, the LLM is deemed to "know" the answer.

Do LLMs really know that this answer is correct?

State-of-the-art LLMs with 7B+ parameters can now achieve an accuracy exceeding 70% on MMLU. Considering the inherent noise in MMLU, including ambiguous questions and occasional errors in the gold answers, this performance is not so far from the benchmark's practical limits. If an LLM genuinely "knows" the correct answer, minor changes to the prompt format or the task description should not significantly affect the results. Furthermore, a truly knowledgeable LLM should not only identify the correct answer but also recognize that the other options are incorrect.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I present a new evaluation benchmark, MULULU (Multi-task Universal Language Understanding with Lower Uncertainty; name found by ChatGPT), that I derived from MMLU to test how consistent an LLM is in its answer, and robust to small changes in MCQ benchmarks like MMLU. We’ll observe that altering the format of answer IDs, such as switching from uppercase to lowercase, or from letters to numbers, can significantly affect the accuracy of the models, even though such changes wouldn’t pose any issue for humans. Additionally, we’ll see that most LLMs still struggle to identify an incorrect answer when explicitly asked to do so.

For this evaluation, I used the Evaluation Harness (lm-eval). In the notebook below, I show how to integrate a custom benchmark into this framework and use it to evaluate LLMs.

Get the notebook (#136)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share