Hi Everyone,
In this edition of The Weekly Kaitchup:
Are LLMs Trained on the Benchmarks?
Llama 3: Not Good Quantized?
Predict Multiple Tokens at Once for Better and Faster LLMs
The Kaitchup has now 3,313 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks (60+) and more than 100 articles.
The yearly subscription is now 35% off. This promotion is available until May 11th.
Are LLMs Trained on the Benchmarks?
LLMs are now pre-trained on trillions of tokens. It’s so much data that we are not far from using all the “good” data that the Internet has to offer.
Unfortunately, this often includes the data of the benchmarks that are used to evaluate LLMs. If the LLM has seen the benchmark data, all or part of it, during pre-training, it will obviously perform much better on this benchmark and it's results would be useless. It’s like seeing the questions and/or answers of an exam before it happens. In an article for The Salt, I studied the impact of data contamination and showed that it is extremely easy to make an LLM perform very well on some benchmarks without degrading its performance on other benchmarks:
Removing the benchmark data from the training data is a very difficult task. To find whether LLMs have seen the benchmarks during pre-training is also difficult. One of the most effective methods is to create an entirely new benchmark by hand and compare the results with a similar benchmark.
This is what this work has done:
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
For mathematical reasoning, almost all the recent LLMs were evaluated on the same benchmark: GSM8k. Some of them are extremely good on this benchmark. Did they see GSM8k during pre-training?
To answer this question, the authors made an alternative benchmark similar to GSM8k but with new mathematical problems not in GSM8k. They called this new dataset GSM1k.
They evaluated many LLMs on this new benchmark and compared the scores with the scores obtained on GSM8k:
Most LLMs have significantly lower scores on GSM1k. Phi-3 and Mixtral-8x22b show signs of data contamination. However, note that this is not enough evidence to conclude that they have been trained on GSM8k.
But it’s more evidence that we can’t trust the benchmarks to conclude whether an LLM is better than another one. If you need to decide which LLM to use, ignore public benchmarks and run your own evaluation on your own data.
Llama 3: Not Good Quantized?
I saw this plot discussed and misinterpreted a lot on social networks this week:
It comes from the LocalLLaMA subreddit.
The plot shows the perplexity difference between the non-quantized and quantized versions of Llama 2 and Llama 3, on wikitext, for different quantization precisions.
How can we interpret the results?
The perplexity difference is larger for Llama 3 than for Llama 2. It means Llama 3 is more subject to quantization error, on this particular dataset. It does not mean that quantization for Llama 3 makes the model worse than Llama 2.
The plot doesn’t show the perplexity scores of the models on wikitext. That would have been interesting because Llama 2 is actually significantly better (lower perplexity) than Llama 3 on this particular dataset.
We don’t know the pre-training data used by both models but we can be sure that Wikipedia is part of it. Wikitext was made from Wikipedia dumps. However, since Llama 3 has been trained on many more tokens than Llama 2, we can assume that it didn’t memorize Wikipedia as well as Llama 2 for which it was a larger part of the training data.
To check whether quantized Llama 3 is really worse than quantized Llama 2, we should run the evaluation on several datasets, preferably from different domains and unseen during pre-training.
Predict Multiple Tokens at Once for Better and Faster LLMs
A new paper proposes a method that simultaneously predicts multiple future tokens from each position in the training datasets:
Better & Faster Large Language Models via Multi-token Prediction
Although not a new concept, this work introduces a streamlined multi-token prediction architecture that does not increase training time or memory consumption.
The research demonstrates that this training approach is particularly effective for larger models (up to 13 billion parameters) showing significant improvements in solving coding problems.
Additionally, the technique enables self-speculative decoding, i.e., without a draft model, which significantly speeds up model inference—up to three times faster—across various batch sizes.
Evergreen Kaitchup
In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) I have checked and updated, with a brief description of what I have done.
This week, I revised the notebook estimating the memory consumption of LLMs:
#64 Estimate the Memory Consumption for Fine-tuning and Running LLMs
I removed the memory consumption of the dropouts for inference since dropout is only used for fine-tuning. The difference is not significant but the estimation is now slightly more accurate. I also modified the equation estimating the memory consumption of the optimizer. In most implementations, Adam(W) makes one copy of the model’s parameters as float32 which is now taken into account by the notebook. I also added the memory consumption of the gradients which is equal to the memory consumption of the model.
Remember that the notebook only provides an estimation without considering any optimizations.
I have also updated the article:
Note: These modifications were suggested by . That’s a good occasion for me to recommend his YouTube channel which provides deep dives into recent advances in AI.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week in the Salt, I briefly reviewed:
Multi-Head Mixture-of-Experts
⭐SnapKV: LLM Knows What You are Looking for Before Generation
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Make Your LLM Fully Utilize the Context
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!