Hi Everyone,
In this edition of The Weekly Kaitchup:
Yi-1.5: Better than Llama 3 8B?
Phi-3 “small” and Phi-3 “medium”
Mistral 7B v0.3 with Function Calling
The Kaitchup has now 3,675 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks (70+) and more than 100 articles.
I’m offering a 35% discount for the yearly subscription to The Salt, my other AI newsletter. Available until May 28th.
Yi-1.5: Better than Llama 3 8B?
01.AI released a new version of its Yi LLMs:
Hugging Face collection: Yi-1.5 (2024/05)
Yi-1.5 is Yi but further pre-trained on 500B additional tokens, according to the model card:
it is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples.
I don’t know what is in these 500B tokens but it significantly boosted the performance of Yi to the point that Yi-1.5 9B outperforms Llama 3 8B. The 6B version also outperforms Llama 3 8B on some of the benchmarks:
I can confirm that the models perform very well. However, I’m unsure whether they are better than Llama 3. As usual, it depends on the target tasks.
They will publish an update of their technical report to document how they have trained Yi-1.5. Meanwhile, you can check my review of Yi here:
The notebook showing how to quantize, fine-tune, and run Yi LLMs (also works for Yi-1.5) is here:
Phi-3 “small” and Phi-3 “medium”
Almost one month ago, Microsoft was releasing Phi-3 “mini, a 3.8B parameter LLM. At that time, they also mentioned that larger versions would be released later.
These larger versions, Phi-3 “small” and Phi-3 “medium”, are now available on the Hugging Face Hub (MIT license). They have 7B and 14B parameters, respectively:
According to Microsoft’s evaluation of the models, they perform extremely well.
Phi-3 “medium” outperforms GPT-3.5-Turbo and Command-R+, a 104B parameter model. Moreover, it performs closely to Llama 3 70B instruct.
Assuming that Microsoft conducted all the necessary checks to ensure that Phi-3’s training data were not contaminated by the benchmarks, these results are quite surprising.
In an upcoming article, I’ll review Phi-3 medium, including its memory consumption and the performance of its quantized version. Meanwhile, you can read my review of Phi-3 “mini” here:
Mistral 7B v0.3 with Function Calling
Mistral AI released a v0.3 for Mistral 7B:
This new version adds 768 tokens to the original Mistral 7B.
Here are some of the added tokens:
The remaining added tokens are “[control_X]” special tokens that are not trained.
All these new tokens were added to support function calling with Mistral 7B. It should now be much easier to use Mistral 7B for calling external tools/APIs.
Evergreen Kaitchup
In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) and articles I have checked and updated, with a brief description of what I have done.
This week, I updated the article and notebook comparing 8-bit, 4-bit, 3-bit, and 2-bit quantizations for Llama 3 8B.
All the plots have been updated to show that while GPTQ 4-bit doesn’t perform well with Llama 3 8B, bitsandbytes NormalFloat4 is much more accurate and maintains the accuracy of Llama 3 8B 4-bit above Llama 2 7B 4-bit.
That’s good news for QLoRA users.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
I reviewed DeepSeek-V2, a huge MoE model with numerous experts. I also showed how to fine-tune DeepSeek-V2 Lite:
This week in the Salt, I briefly reviewed:
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
⭐Understanding the performance gap between online and offline alignment algorithms
LoRA Learns Less and Forgets Less
⭐RLHF Workflow: From Reward Modeling to Online RLHF
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!