Hi Everyone,
In this edition of The Weekly Kaitchup:
Jamba 1.5: Two New Hybrid Transformers/SSM of 52B and 398B Parameters
End-to-end FP8 Pre-training
The Unexpected Impact of Code in Pre-training Data
If you are a free subscriber, consider upgrading to paid to access all the notebooks (90+) and more than 150 articles.
If you are looking for custom AI notebooks, priority support, or professional LLM services, have a look at The Kaitchup Pro:
AI Notebooks and Articles Published this Week by The Kaitchup
Notebook: #96 Fine-tuning Llama 3.1 Quantized with AQLM, HQQ, GPTQ, and AutoRound -- Code and Training Logs
Notebook: #97 Fine-tuning Phi-3.5 MoE and Mini -- With Code for AutoRound and Bitsandbytes Quantization
Jamba 1.5: Two New Hybrid Transformers/SSM of 52B and 398B Parameters
Jamba is a hybrid decoder architecture. It integrates:
Transformer layers with state-space Mamba layers
A mixture-of-experts (MoE)
This is a “Jamba block”:
AI21Labs has just released the new Jamba 1.5. It includes two MoEs: Mini, with 12B active parameters out of a total of 52B, and Large, with 94B active parameters out of a total of 398B.
The license is a custom license. Commercial use is allowed unless your company generates more than $50 million in annual revenue.
Jamba 1.5 is optimized for handling long-context retrieval-augmented generation (RAG) tasks, supporting up to 256K context. This makes it highly suitable for applications requiring extensive contextual understanding.
The model also supports eight languages, including English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew.
Jamba 1.5 is optimized for function calling and structured output, particularly in JSON format. Additionally, a new quantization technique, ExpertsInt8, has been introduced.
In terms of performance, Jamba 1.5 is up to 2.5 times faster inference in long-context scenarios compared to standard decoder-only models.
For instance, the Mini version can handle sequences of up to 140k tokens on a single A100 80GB GPU. This would be impossible with a standard transformer LLM, even if you quantize the KV cache.
As for the accuracy of the model, the Mini seems to perform similarly to Gemma 2 9B while the Large performs on par with Llama 3.1 70B. Given the large size of the models, I would use them only for applications requiring very long context processing.
I reviewed the first version of Jamba in this article for The Salt:
source: The Jamba 1.5 Open Model Family: The Most Powerful and Efficient Long Context Models
End-to-end FP8 Pre-training
When training LLMs with full precision, the weights, gradients, and optimizer states are typically stored as float32 parameters, each occupying 4 bytes of memory. For every model parameter, we have a weight, a gradient, and two optimizer parameters (in the case of AdamW). This adds up to 16 bytes per parameter. For a model with 10 billion parameters, pre-training would require at least 160 GB of GPU memory—excluding the memory needed for activations.
With mixed-precision training, the weights and gradients are stored as bfloat16, reducing their memory footprint to 2 bytes each, while the optimizer states remain as float32. This reduces the memory requirement for the same 10 billion parameter model to 120 GB. Further optimization can be achieved by quantizing the optimizer states to 8-bit (1 byte) using the FP8 data type, which brings the memory usage down to just 60 GB.
The weights and gradients remain in bfloat16, as quantizing them to 8-bit often leads to instability and divergence during pre-training. If this challenge could be overcome, the memory consumption could be further reduced to 40 GB.
Hugging Face appears to have developed a stable method for end-to-end FP8 training, potentially offering a solution to these challenges. In a nutshell:
Slowing down training before instability could happen and minimizing outlier features in activations
How to do this is complicated and would require a long article to explain it. You can find the details of HF’s recipe on X.
End-to-end FP8 training will be implemented in Nanotron, another HF framework.
The Unexpected Impact of Code in Pre-training Data
This is a finding by Cohere: Removing code from the pre-training data significantly degrades the LLM’s performance in reasoning and language generation tasks.
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Including code in the pre-training data of LLMs has become a common practice to enhance their code generation abilities. However, the effect of incorporating programming languages like Python, Java, and C++ on the overall language generation capabilities of LLMs had not been thoroughly studied before. This inclusion may be a contributing factor to the observed improvements in LLM performance.
compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively.
I initially expected that removing code from the pre-training data would enhance an LLM's language generation capabilities by allowing it to focus more on natural language.
Moreover, given that adding code actually improves pre-training, it's plausible that incorporating high-quality code into the fine-tuning data could also further boost performance. This is an area that warrants further investigation.
I'll write a full review of Cohere’s paper for The Salt, hopefully by next week.
GPU Cost Tracker
This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.
While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.
To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.
All prices decreased this week!
RTX 4090 (24 GB): GIGABYTE GV-N4090AERO OC-24GD GeForce RTX 4090 AERO (Changed for a cheaper model: $1,699.00 (-$30.00); last week: $1,729.00)
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2 (Changed for a cheaper model: $999.99 (-$27.2); last week: $1,027.19)
RTX 4070 Ti SUPER (16 GB): ZOTAC GAMING GeForce RTX 4070 Ti SUPER Trinity Black Edition ($789.99 (-$10.00), last week: $799.9)
RTX 4060 Ti (16 GB): ZOTAC Gaming GeForce RTX 4060 Ti 16GB AMP DLSS 3 16GB (Changed for a cheaper model: $424.99 (-$35.00); last week: $459.99)
The Salt
The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week, I reviewed:
FuseChat: Knowledge Fusion of Chat Models
⭐Heavy Labels Out! Dataset Distillation with Label Space Lightening
Layerwise Recurrent Router for Mixture-of-Experts
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% discount for group subscriptions):
Have a nice weekend!