Behind the Hype: Models based on T5 (2019) Still Better than Vicuna, Alpaca, MPT, and Dolly
A new study shows that there hasn’t been much progress behind the recent surge of chat models.
A research team from Alibaba and Singapore University has recently released a new leaderboard for instruction-tuned large language models (LLMs):
Scientific paper: INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Chia et al., 2023)
All the chat models recently released belong to this class of models: Vicuna, Alpaca, Dolly, and ChatGPT.
The results on benchmarks for “problem solving” are very interesting:
ChatGPT is the best on average. But if you look at the 3rd rank, you’ll see “Flan-T5”. A base model (T5) that was released in 2019 and fine-tuned with instructions to become Flan-T5.
Flan-T5 outperforms all the LLaMa and OPT-based models which are billion-parameters bigger.
This is the first time we see this because chat models that are recently published are usually only compared to other recent ones, e.g., Vicuna versus Alpaca.
Thanks to Chia et al.’s work we have now a much more complete overview of the state of the art.
Why Flan-T5 Completely Fails at Coding?
Flan-T5 is great, on average. But it completely fails the HumanEval coding benchmark with a 0.0 score.
Why?
Because the base model, T5, doesn’t have in its vocabulary all the tokens we need to code. It misses for instance: <, {, and }. Without them, T5 can’t generate code that compiles for most programming languages.
Why these characters are missing is beyond my understanding. When Google preprocessed the datasets to train the sentence piece model used for pre-training T5, they didn’t preserve these characters for some reason.
Conclusion
If you are looking for an open LLM good at problem-solving (but not for coding), choose a Flan-T5 model! It’s smaller and better than more recent ones.
T5-based models are also a good choice for commercial applications. They are not in this grey area where we have models released under an Apache 2.0 license (Alpaca) but based on a model that can’t be used commercially (LLaMa).
You can find some Flan-T5 models on the Hugging Face Hub. Here is Flan-T5 XXL.