The Weekly Kaitchup #67

Qwen2.5 Coder 32B - Quantization for Models Trained on Large Datasets

Nov 15, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

Qwen2.5 Coder 32B: The Best LLM for Coding Tasks?
More Training Tokens => More Difficult to Quantize!

I've also updated the Qwen2.5 toolbox repository, adding comments to most notebooks and introducing two new ones for DPO full training and ORPO with LoRA.

Get the Qwen2.5 Toolbox

Next week, I'll be releasing the Llama 3 toolbox.

Qwen2.5 Coder 32B: The Best LLM for Coding Tasks?

Qwen2.5-Coder is a series of code-specific LLMs available in six sizes, ranging from 0.5 to 32 billion parameters. The 32B version has been released this week.

Hugging Face Collection: Qwen2.5-Coder

The model is trained on a large dataset of 5.5 trillion tokens that includes source code, text-code grounding, and synthetic data.

It supports long-context lengths up to 128,000 tokens and, according to evaluations published by the Qwen team, it outperforms other open models and GPT-4o in coding tasks:

Some details of the architecture of the 32B model:

Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
Number of Parameters: 32.5B
Number of Parameters (Non-Embedding): 31.0B
Number of Layers: 64
Number of Attention Heads (GQA): 40 for Q and 8 for K
Vocabulary size: 152064

The model is distributed with an Apache 2.0 license.

I recommend reading the technical report, especially the sections discussing the training data, post-training, and training policy:

Qwen2.5-Coder Technical Report

The model card shows how to run the model with Hugging Face Transformers.

More Training Tokens => More Difficult to Quantize!

A few days after the release of Llama 3, in April 2024, we observed that the models were more difficult to quantize, e.g., with GPTQ, than Llama 2.

Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead

Benjamin Marie

May 16, 2024

Read full story

One hypothesis to explain this was that Llama 3 was trained on such a large volume of tokens that each weight carried significantly more information. As a result, reducing their precision would have a greater impact.

This assumption appears to be correct: training a model on more data makes it more challenging to quantize accurately.

In a new paper, "Scaling Laws for Precision", CMU, Stanford, Databricks, Harvard, and MIT (a lot of big names! ) examine the challenges associated with quantizing LLMs when they are extensively trained on large datasets.

The authors formulate scaling laws that predict how model performance degrades when weights, activations, or attention mechanisms are quantized at various precisions during both pre-training and post-training phases. The study finds that overtraining models on large datasets can increase their sensitivity to quantization during inference, potentially leading to decreased performance.

This degradation follows a power-law relationship with the token-to-parameter ratio observed during pre-training, indicating a critical data size beyond which additional training may be detrimental if the model is intended for quantized deployment. The authors also introduce the concept of "effective parameter count" to explain how reducing precision affects a model's capacity, noting that while weights can be trained in low precision without significant issues, activations and key-value caches are more sensitive.

Note that this study is limited to small models (up to 1.7 billion parameters) and GPTQ and AWQ quantization. While Llama 3 is indeed more difficult to quantize with GPTQ, bitsandbytes performed much better. It is possible that it is some specific quantization techniques that are significantly impacted by the number of training tokens.

GPU Cost Tracker

This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.

While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie

July 18, 2024

Read full story

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

GPU Selection of the Week:

RTX 4090 (24 GB): GIGABYTE GV-N4090AERO OC-24GD GeForce RTX 4090 AERO
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2
RTX 4070 Ti SUPER (16 GB): GIGABYTE GeForce RTX 4070 Ti Super WINDFORCE
RTX 4060 Ti (16 GB): Asus Dual GeForce RTX™ 4060 Ti EVO OC Edition 16GB GDDR6

No change since last week!

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I reviewed:

Memory-Efficient Inference with 4-bit Activations for 1-bit LLMs

Benjamin Marie

November 12, 2024

Read full story

⭐BitNet a4.8: 4-bit Activations for 1-bit LLMs
Self-Consistency Preference Optimization
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #67

Qwen2.5 Coder 32B - Quantization for Models Trained on Large Datasets

Qwen2.5 Coder 32B: The Best LLM for Coding Tasks?

More Training Tokens => More Difficult to Quantize!

Avoid Quantizing Llama 3 8B with GPTQ and Use BitsandBytes Instead

GPU Cost Tracker

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

GPU Selection of the Week:

The Salt

Memory-Efficient Inference with 4-bit Activations for 1-bit LLMs

Discussion about this post