The Decontaminated Evaluation of GPT-4

GPT-4 won’t be your lawyer anytime soon

Benjamin Marie

Mar 27, 2023

GPT-4 was announced by OpenAI in March with impressive demonstrations and outstanding claims.

Most of these claims come from their own evaluation of GPT-4.

OpenAI used many existing professional and academic exams for this evaluation.

But evaluating large language models on public benchmarks is extremely challenging.

Models such as GPT-4 are exposed to “data contamination”, i.e., they may have been trained on their evaluation data.

Why is this a problem?

Let’s take an example.

GPT-4 was evaluated on the LSAT exam. To perform a scientifically credible evaluation, OpenAI had to check whether the LSAT questions used for evaluation were not in the training data of GPT-4. If they were, GPT-4 could have memorized the questions and then would obviously perform better on these specific questions at evaluation time.

It’s like a human who had access to the exam questions before it happened.

You can say it’s like cheating.

In the GPT-4 technical report, one of the few things OpenAI disclosed about GPT-4 is the data contamination of their evaluation. They exposed their strategy to quantify and assess this contamination and drew several conclusions from their observations.

In this article, I review and discuss how OpenAI dealt with the data contamination of GPT-4. I expose several pitfalls in their method.

I can’t agree with several of their conclusions.

Decontamination of the evaluation data

To check whether there is an intersection between the training and evaluation data, OpenAI used a very simple technique relying on a substring matching algorithm (described page 28 of the technical report).

First, they removed all spaces and symbols in the training and evaluation data (the exams). They kept the numbers.

Then, they randomly picked 3 substrings of 50 characters for each question (or equivalent) in the exams used for evaluation. If one of these substrings happened to be in the training data of GPT-4, the question is removed from the evaluation data.

With this method, they made two critical choices.

The first one is that this method is random.

Choosing 3 random substrings is particularly problematic for exams with very long questions.

For instance, one question in the Uniform Bar Exam may contain 1,500 sequences of 50 characters. Note: They are very long questions, see some examples.

Randomly choosing 3 substrings among 1,500 means that a large part of each question is completely ignored by this decontamination strategy.

This strategy can’t reliably detect whether a large part of a question is in the training data.

We can imagine that some of these exam questions have been studied or discussed in the GPT-4 training data, but partly and not entirely since they are very long questions. So a partial but significant match wouldn’t be detected in that case.

The uniform bar exam has 400 questions. But by randomly checking 3 substrings for each question, OpenAI did not find any of these questions in the training data.

The second critical choice is that they decontaminated the evaluation data and not the training data.

Removing questions from the training data, retraining GPT-4, and then evaluating it on the exams again would have been too costly, obviously.

However, if they had assessed this contamination earlier in their development process, i.e., before training, they could have removed all the exam examples from the training data.

It is also important to note that they didn’t include the data of RLHF in their decontamination process. If a question of an exam is in the RLHF, it will remain in the evaluation data.

Definition

RLHF stands for Reinforcement Learning from Human Feedback. Once pre-trained, GPT-4 is further fine-tuned using reinforcement learning on human feedback to improve its performance. This dataset of “feedback” was not checked for the decontamination.

The main reason given for not including the RLHF training data is that the fine-tuning exploiting RLHF did not significantly improve the performance of GPT-4. They only observed a +0.3% on the average score after RLHF post-training.

Thank you for reading The Kaitchup. This post is public so feel free to share it.

The details of the contamination for each exam are given page 30 of the report.

Among the 49 exams used for evaluation, 12 were found completely absent from the training data. They are: all the Leetcode datasets, the Uniform Bar Exam, SAT EBRW exam, and some AP exams.

In total, the exams used for evaluation contain 4,123 questions. 545.5 of these questions have been found in the training data. Note: Why is there a “.5”? As far as I understand, OpenAI removed the question entirely if there is a match. But for the exam “USA Biolympiad Semifinal Exam 2020”, that contains 150 questions, they note that they removed 3.00% of the questions (see Table 10 of the paper). 3% of 150, that’s 4.5. One of these numbers is probably wrong.

This is 13.2% of the evaluation data that are contaminated.

Interestingly, for several exams, the decontamination seems to improve the results obtained by GPT-4.

This is counter-intuitive.

We may think that if the removed questions were in the training data, GPT-4 should be good at answering them since it had the opportunity to memorize them.

But we know nothing of these excluded questions.

They may be the most difficult ones for some exams, hence the higher percentage of correct answers after excluding them from the evaluation.

OpenAI claims that the contamination didn’t have a significant impact. They note:

Overall across most exams, both contamination and vision have relatively little effect. (Caption of Table 9)

The degration is generally small and as often postive as negative […] (Caption of Table 10)

This is the “overall” conclusion. If we look closer at the results, that’s not so obvious. Let’s see some of the details.

In Table 10 of the technical report, OpenAI has also evaluated GPT-4 on two separate set of questions for each exam:

“contaminated”: This set contains only the questions found in the training data.
“non-contaminated”: This set contains all the remaining questions.

This is an interesting experiment. The performance of GPT-4 on these two kinds of datasets (5th and 6th columns) varies extremely for some exams, for instance from 41.67% to 0% for AMC 12.

For some other exams, GPT-4 performed better on the evaluation data it didn’t use during training (non-contaminated).

Does it mean that GPT-4 is better for questions it did not see during training?

No, “contaminated” and “non-contaminated” are just two different evaluation data.

GPT-4 may perform better on one of the two datasets for many different reasons, for instance, given the topic of the questions, their length, their difficulty, etc.

Is GPT-4 good at these exams?

Let’s have a specific look at the LSAT exam. And let’s say that a score above 160 is a good score on this exam.

GPT-4 achieved a score of 163. After decontamination, removing 39% of the questions, GPT-4 achieved an even better score of 167.

Can we conclude that GPT-4 can achieve a good score on the LSAT exam?

Yes, we can. But only if cheating is allowed.

On one hand, we have the full exam on which GPT-4 performs at 163. It’s a good score but GPT-4 saw some of the questions before passing the exam.

On the other hand, if we remove 39% of the questions for decontamination, this is not an LSAT exam anymore. No human passed a 61% LSAT. This exam doesn’t exist.

Moreover, the 39% of questions removed may contain the most difficult questions. We don’t know if a score of 167 is good or bad on this 61% LSAT.

We can reason similarly for all the other “contaminated” exams used for evaluation.

Some exams were not “contaminated”, such as the Uniform Bar Exam and Leet code questions, but there are additional issues.

I won’t write about these issues here. Arvind Narayanan and Sayash Kapoor already discussed the results for these questions in their formidable article that you can read here:

GPT-4 and professional benchmarks: the wrong answer to the wrong question
We don't know the answer, but we hope to inject some reality into the conversation. OpenAI may have violated the…aisnakeoil.substack.com

Conclusion

As I wrote in the introduction, assessing the data contamination of large language models is an extremely difficult task.

When collecting and preprocessing the training data, ideally we should have already identified a list of publicly relevant exams and benchmarks to exclude from the training data.

Nonetheless, my opinion is that it actually makes a lot of sense for OpenAI to train GPT-4 on all these exams.

The goal is also to have a GPT-4 as good as possible for the questions posed by these exams. I can see a lot of potential use cases for GPT-4 in this area, such as helping students and teachers to prepare exams.

Yet, this choice has a cost: We cannot use these exams to evaluate GPT-4 with scientific credibility.

The Kaitchup – AI on a Budget

Discussion about this post