MMLU: Do LLMs Really Know?
When 1. = 1. = A. = A. = a. = A) disrupts LLM's world knowledge and language understanding capabilities
When released, large language models (LLMs) are evaluated on various benchmarks that test different skills, such as commonsense reasoning, coding skills, language understanding, problem-solving, and world knowledge.
Language understanding and world knowledge are frequently assessed using the MMLU benchmark (or its various adaptations), which consists of thousands of multiple-choice questions (MCQ), each with four possible answers, one of which is correct. Unlike generative benchmarks, MMLU does not evaluate the model's ability to generate text. Instead, the LLM assigns a score to each of the four answers, and if the correct answer receives the highest score, the LLM is deemed to "know" the answer.
Do LLMs really know that this answer is correct?
State-of-the-art LLMs with 7B+ parameters can now achieve an accuracy exceeding 70% on MMLU. Considering the inherent noise in MMLU, including ambiguous questions and occasional errors in the gold answers, this performance is not so far from the benchmark's practical limits. If an LLM genuinely "knows" the correct answer, minor changes to the prompt format or the task description should not significantly affect the results. Furthermore, a truly knowledgeable LLM should not only identify the correct answer but also recognize that the other options are incorrect.
In this article, I present a new evaluation benchmark, MULULU (Multi-task Universal Language Understanding with Lower Uncertainty; name found by ChatGPT), that I derived from MMLU to test how consistent an LLM is in its answer, and robust to small changes in MCQ benchmarks like MMLU. We’ll observe that altering the format of answer IDs, such as switching from uppercase to lowercase, or from letters to numbers, can significantly affect the accuracy of the models, even though such changes wouldn’t pose any issue for humans. Additionally, we’ll see that most LLMs still struggle to identify an incorrect answer when explicitly asked to do so.
For this evaluation, I used the Evaluation Harness (lm-eval). In the notebook below, I show how to integrate a custom benchmark into this framework and use it to evaluate LLMs.