Hi Everyone,
In this edition of The Weekly Kaitchup:
Qwen2.5: New Small and Large State-of-the-art Language Models
A Cheaper Way to Make Ternary LLMs
Moshi: Accurate and Fast Speech Models for Real-time Dialogues
My book “LLMs on a Budget” is available for pre-order. Get all the details and a 50% discount here:
I’m planning to publish the first chapter around the 15th of October.
The Kaitchup has now more than 5000 subscribers. If you are a free subscriber, consider upgrading to paid to access all the notebooks (100+) and more than 150 articles.
Note: Consider switching to a yearly subscription if you are a monthly subscriber. It is 38% cheaper.
Qwen2.5: New Small and Large State-of-the-art Language Models
Alibaba released Qwen2.5. The models are available in many sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B, as base and instruct models:
Hugging Face collection: Qwen2.5
They also released Coder and Math versions.
With all these sizes, Qwen2.5 scales from low-end consumer GPUs to high-end professional GPUs, offering flexibility across various hardware. The 7B and 14B models are ideal for 24 GB GPUs, and they support efficient fine-tuning with LoRA and QLoRA. For full fine-tuning, smaller models are a better fit. The 32B model may also be fine-tuned on a 24 GB GPU using QLoRA, though this depends on its activation size—I’m still confirming this.
I’m currently training Minivoc versions for the 1.5B and 7B models, which will make fine-tuning Qwen2.5 4x to 8x cheaper. I’m aiming to release them by Monday!
Qwen2.5 1.5B Minivoc will be distributed with an Apache 2.0 license while the 7B version will be available only for The Kaitchup Pro subscribers for at least a few weeks.
If you want to use the instruct versions, i.e., if you are not interested in fine-tuning Qwen2.5, the 0.5B, 1.5B, 3B, and 7B will run on a 24 GB, for instance using vLLM, if you don’t set a max_model_len too high, as we discussed in this article:
Note that the 3B and 72B models are the only ones not distributed with an Apache 2.0 license. The 3B version can only be used for research purposes while the 72B model can’t be used if “your product or service has more than 100 million monthly active users”.
How good are the models?
According to public benchmarks, they are the best open LLMs that you can find. They largely outperform Qwen2, Gemma 2, and Llama 3.1.
I trust these benchmarks to be accurate, as the Qwen team has consistently delivered models whose real-world performance aligns closely with their benchmark results. Their track record shows a strong correlation between benchmark metrics and actual application outcomes.
I’ll publish next week, probably Monday, a tutorial showing how to fine-tune Qwen2.5. Thursday’s article will be about transformer.js!
source: Qwen2.5: A Party of Foundation Models!
A Cheaper Way to Make Ternary LLMs
BitNet is a specialized transformer architecture introduced by Microsoft Research that represents each parameter with just three values: -1, 0, and 1. This approach results in a model using only 1.58 bits per parameter (since log₂(3) ≈ 1.58), significantly cutting down on memory usage and computational costs.
While BitNet requires training a model from scratch, an article by Hugging Face explores techniques to fine-tune existing pre-trained models to achieve 1.58-bit quantization.
Fine-tuning LLMs to 1.58bit: extreme quantization made easy
It’s a long article but I think Hugging Face did a great job at making it accessible.
By employing strategies like gradual quantization and dynamic adjustment of training parameters, the authors successfully fine-tuned a Llama 3 8B model using the BitNet architecture. The fine-tuned model demonstrated strong performance on downstream tasks, even surpassing the LLaMA 7B model in benchmarks such as MMLU. Not bad for a model with only -1, 0, and 1 parameters!
The article also tackles the implementation of custom kernels and benchmarks to optimize inference speed.
I think the method is still a bit too difficult to reproduce but it is definitely a step forward to better and more accessible ternary LLMs.
Moshi: Accurate and Fast Speech Models for Real-time Dialogues
There is also regular progress in speech modeling. And I think Moshi is a significant achievement.
Moshi is a speech-text model and a full-duplex spoken dialogue framework that leverages Mimi, a state-of-the-art streaming neural audio codec. Mimi compresses 24 kHz audio into a 12.5 Hz representation with a bandwidth of just 1.1 kbps, achieving superior performance compared to existing non-streaming codecs. It operates fully in streaming mode with an 80 ms latency and seems to maintain a high audio quality despite its low bitrate.
How does it work?
In Moshi, two audio streams are modeled simultaneously: one for the user and one for Moshi itself. During inference, the user's audio stream is captured from the input, while Moshi's audio is generated by sampling from the model's output.
Alongside these audio streams, Moshi predicts text tokens corresponding to its own speech—its "inner monologue"—which enhances the quality of its responses. The framework uses a small Depth Transformer to model inter-codebook dependencies at each time step and a larger 7-billion-parameter Temporal Transformer to handle temporal dependencies.
The models are on HF and code on GitHub:
And you can try it thanks to this demo (but there is a queue…):
GPU Cost Tracker
This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.
While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.
To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
Bad timing for buying an RTX 4090. Price increases again!
RTX 4090 (24 GB): ASUS TUF Gaming GeForce RTX™ 4090 OG ($1,819.99 (+$40.00), changed for a cheaper card)
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2 ($999.99, same card as last week)
RTX 4070 Ti SUPER (16 GB): MSI Gaming RTX 4070 Ti Super 16G AERO ($789.99 (+$10.99); changed for a cheaper card)
RTX 4060 Ti (16 GB): MSI Gaming GeForce RTX 4060 Ti 16GB GDRR6 Extreme Clock ($429.99, same card as last week)
Failures and Disappointments
I spend a lot of time experimenting with new frameworks, techniques, and models. The reality is that many times, things don't go as planned—whether it's a buggy framework, overly rigid implementations, or models that fall short of their claims. In this section, I share what I’ve tried this week that didn’t work out as expected.
FLUTE: I introduced FLUTE in last week’s Weekly Kaitchup and was excited to test it to speed up my QLoRA fine-tuning. Unfortunately, I haven’t been able to get it running. I tested it on Google Colab’s L4, but FLUTE mistakenly identified it as an RTX 4090 GPU, leading to failures during model preparation. It also seems that models need to be fully loaded onto some devices before quantization, adding more complexity.
X-LoRA: This was part of the latest update to PEFT. X-LoRA fine-tunes a combination of LoRA adapters, functioning like a mixture of experts, with each "expert" being a LoRA adapter. However, X-LoRA training isn’t functional yet in PEFT. I’m working directly with the author to troubleshoot, and once the training works, I’ll publish my article on X-LoRA, which is almost ready.
Online DPO: A powerful technique for aligning LLMs with human preferences, supported by TRL, but it's still quite challenging to use. It currently lacks support for QLoRA/LoRA and chat templates. Hugging Face is actively updating it, and I hope to share more positive results in the near future. Note: HF just released this morning an update for TRL introducing LoRA fine-tuning support for online DPO.
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week in The Salt, I reviewed:
⭐Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):
Have a nice weekend!