Hi Everyone,
In this edition of The Weekly Kaitchup:
Llama 3.1: Longer Context and Multilingual Llama 3
Mistral Large 2: As Good as Llama 3 405B, Really?
Free Fine-tuning for GPT4o-mini
If you are a free subscriber, consider upgrading to paid to access all the notebooks (80+) and more than 100 articles.
If you are looking for custom AI notebooks on request, priority support, or professional LLM services, have a look at The Kaitchup Pro:
AI Notebooks and Articles Published this Week by The Kaitchup
Notebook: #89 Function Calling: Fine-tuning LLMs on xLAM -- Examples with Llama 3 and Qwen2
Llama 3.1: Longer Context and Multilingual Llama 3
Along with Llama 3 405B, Meta also released new versions of Llama 3 8B and 70B. They are named Llama 3.1 and you can find them here:
Meta published a new report describing the models:
The main differences with Llama 3 are:
Longer context: These new versions have been post-trained on very long sequences of 128k tokens. Thus, they handle contexts of up to 128k tokens without any accuracy drop. If you are considering using very long contexts, I suggest quantizing your KV cache otherwise handling 128k would consume a large amount of memory:
Multilingual: Llama 3.1 officially supports German, French, Italian, Portuguese, Hindi, Spanish, and Thai. I don’t know why they chose these languages, especially Thai which is a very difficult language to model and is considered a low-resource language.
Function calling: Meta also trained the model for function calling. For this purpose, they modified the tokenizer to replace some of the existing special tokens. You will find examples in this blog post by Hugging Face.
Another good news is that the new license Llama 3.1 allows us to use Llama 3 to improve other LLMs. For instance, you can now distill Llama 3 to train smaller LLMs. The only constraint is that you would have to put “Llama” at the beginning of the name of the model.
According to Meta, these new models are better on public benchmarks:
As usual, take these results with a pinch of salt. We can’t reproduce these results. For the other models such as Gemma and Nemotron, Meta only copied the results published by other papers. Most of them are not comparable since they used different hyperparameters, prompt templates, examples for few-shot learning, etc.
Meta also plans to release multimodal versions of Llama 3 later. These versions will have vision and audio adapters. However, there is a possibility that they won’t be available in Europe according to Yann LeCun.
Mistral Large 2: As Good As Llama 3 405B, Really?
While we knew that Llama 3 405B would be released this week, Mistral AI was silently preparing its release of Mistral Large 2:
It is a 123 billion parameter model, i.e., 3.3x smaller than Llama 3 405B. We find more or less the same capabilities in both models: function calling, support for other languages, context length of 128k, etc.
In its blog post, Mistral AI claims that Mistral Large 2 performs on par with Llama 3 405B.
They have conducted the first “third-party” evaluation of Llama 3 405B and interestingly, but not surprisingly, obtained different results than the results published by Meta.
On average, Mistral AI observed a drop of 1 point in accuracy compared to the results published by Meta for code generation tasks (“measured” vs. “paper” in the table below). For some languages, there is a difference of 5 points…
Mistral Large 2 seems to perform very well on multilingual benchmarks. In the following results, we can also observe that Llama 3 405B performs well for languages it doesn’t officially support (Dutch, Russian, Japanese, and Chinese).
Given that Mistral Large 2 closely performs to Llama 3 405B, I would recommend using it instead of Llama 3 405B, especially if you want to fine-tune it.
However, you can do it only for research purposes. Commercial use is forbidden by Mistral AI. The license is particularly unclear on what we are allowed to do with Mistral Large 2. Since I’m not sure whether I can write tutorials based on Mistral Large 2, I won’t write articles about it.
Free Fine-tuning for GPT4o-mini
GPT4o-mini is fast, very good, and cheap. You can make it better if you fine-tune it with your data.
Fine-tuning OpenAI models is possible but is usually extremely expensive, especially the preliminary experiments required to find good hyperparameters.
OpenAI announced that fine-tuning GPT4o-mini is free until September 23 for up to 2M tokens/day.
**Fine-tuning for GPT-4o mini is free up to a daily token limit through September 23, 2024. Each qualifying org gets up to 2M complimentary training tokens daily and any overage will be charged at the normal rate of $3.00/1M tokens. (source)
You won’t go far with only 2M tokens/day but it’s a very good opportunity in my opinion to test OpenAI’s fine-tuning API, and its hyperparameters, for free when it’s possible.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, I discussed why we can’t use perplexity to compare different LLMs.
I also reviewed:
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
⭐Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!
Hey I am Thai. The nlp community here is really good and probably we are top internet consumer.
thanks, good to know: "Mistral Large 2: As Good As Llama 3 405B, Really?" simplifies model selection for our R&D for implementation that saves developers' time plus local + cloud costs