The Weekly Kaitchup #64

Qwen2.5 4-bit - SynthID - Aya Expanse

Oct 25, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

New Qwen2.5 Models Quantized to 4-bit
SynthID: AI Content Detection and Watermarking
Aya Expanse: The New Multilingual LLMs by Cohere

The first chapter of The Kaitchup’s book on parameter-efficient fine-tuning is out. I’m now writing the last sections of the second chapter on datasets for fine-tuning. I plan to publish it by the end of November.

You can still purchase the book with a 30% discount here:

Get the book with a 30% discount

You can also get the book for free by subscribing to The Kaitchup Pro:

Subscribe to The Kaitchup Pro

New Qwen2.5 Models Quantized to 4-bit

Qwen2.5 is a great, probably the best, series of LLMs now. They outperform Llama 3.1 and 3.2 on most tasks. With models available in a range of sizes from small to large, there's a Qwen2.5 variant suitable for various hardware configurations.

This week, I released new 4-bit versions of Qwen2.5 1.5B and 7B models, along with their Minivoc variants, which have even greater memory efficiency:

HF Collection: Qwen2.5 Quantized

I made them with AutoRound which is currently one of the most accurate quantization algorithms.

Intel AutoRound: Accurate Low-bit Quantization for LLMs

Benjamin Marie

June 27, 2024

Read full story

I used optimal hyperparameters, with longer training, to get good quantized models:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-7b"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False

# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, bits=bits, group_size=group_size, sym=sym)


autoround.quantize()
output_dir = "./tmp_autoround"

autoround.save_quantized(output_dir, format='auto_gptq', inplace=True)

I evaluated them on several benchmarks (zero-shot):

arc_challenge, musr, gpqa, mmlu_pro, mmlu….png

AutoRound significantly outperforms Bitsandbytes quantization in most cases, with the exception of the Qwen2.5 7B model on benchmarks like Arc Challenge, MUSR, and GPQA. As discussed in a previous article, it’s also possible to fine-tune adapters on top of AutoRound models using QLoRA:

QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

Benjamin Marie

August 19, 2024

Read full story

Note that these AutoRound models are serialized in the GPTQ format so they can easily be used with popular inference frameworks like TGI and vLLM. However, they use asymmetric quantization and are thus not compatible with Marlin for faster inference. If there is a high demand for models with symmetric quantization, then I will make them.

SynthID: AI Content Detection and Watermarking

Google DeepMind and Hugging Face have released SynthID Text, a tool in Transformers v4.46.0 that adds watermarks to AI-generated text and detects them using a trained classifier. This tool helps distinguish AI-generated text from human-written content without affecting the quality of the generated text.

SynthID is presented in this paper (Nature):

Scalable watermarking for identifying large language model outputs

SynthID Text, as implemented in Hugging Face Transformers, uses a pseudo-random function to encode an imperceptible watermark into a text generated by an LLM. It can be applied to any LLM using the Hugging Face Transformers model.generate() API, and a trained model can detect these watermarks. For instance:

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    SynthIDTextWatermarkingConfig,
)

# Standard model and tokenizer initialization
tokenizer = AutoTokenizer.from_pretrained('repo/id')
model = AutoModelForCausalLM.from_pretrained('repo/id')

# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
    keys=[654, 400, 836, 123, 340, 443, 597, 160, 57, ...],
    ngram_len=5,
)

# Generation with watermarking
tokenized_prompts = tokenizer(["your prompts here"])
output_sequences = model.generate(
    **tokenized_prompts,
    watermarking_config=watermarking_config,
    do_sample=True,
)
watermarked_text = tokenizer.batch_decode(output_sequences)

Users define a watermark configuration and pass it to the generation API to produce watermarked text. To detect watermarks, a classifier is trained on examples of watermarked and non-watermarked text.

If it really watermarks the text without affecting its quality, watermarking may become common in the future. Maybe it is already applied, notably on proprietary models by Google like Gemini.

While effective, the watermark can be weakened by significant text edits or translations. It is not designed to completely prevent malicious misuse but can help identify AI-generated content when combined with other tools.

source: Introducing SynthID Text

Aya Expanse: The New Multilingual LLMs by Cohere

Cohere For AI has introduced Aya Expanse, a new family of highly capable multilingual base (not instruct) models available in two sizes, 8 billion and 32 billion parameters.

Note: Unfortunately, the use of these models is limited by a CC-BY-NC license.

The larger Aya Expanse 32B delivers state-of-the-art multilingual capabilities, surpassing other leading models of comparable or larger sizes.

The Aya initiative has been underway for two years, involving collaboration with over 3,000 researchers from 119 countries according to Cohere. This work has led to the release of the Aya collection, which is the largest multilingual dataset collection, and Aya-101, a multilingual model covering 101 languages.

Aya Expanse builds on these efforts by improving language coverage and performance in LLMs, with the 32B version outperforming notable models like Gemma 2 27B, Mistral 8x22B, and even the larger Llama 3.1 70B.

Aya Expanse exploits a new data sampling strategy called data arbitrage to improve synthetic data quality for low-resource languages. This approach selects different "teacher" models for generating synthetic data to avoid generating low-quality outputs. The models underwent preference training to align outputs with global standards of quality and safety, accounting for diverse cultural and linguistic contexts.

They also applied model merging techniques to combine multiple candidate models, which increased the performance of Aya Expanse. These techniques form a training recipe that seems to scale effectively from the smaller 8B model to the larger 32B model.

source: A Deepdive into Aya Expanse: Advancing the Frontier of Multilinguality

GPU Cost Tracker

This section keeps track, week after week, of the cost of GPUs. It only covers consumer GPUs, from middle-end, e.g., RTX 4060, to high-end, e.g., RTX 4090.

While consumer GPUs have much less memory than GPUs dedicated to AI, they are more cost-effective, by far, for inference with small batches and fine-tuning LLMs with up to ~35B parameters using PEFT methods.

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

Benjamin Marie

July 18, 2024

Read full story

To get the prices of GPUs, I use Amazon.com. If the price of a GPU drops on Amazon, there is a high chance that it will also be lower at your favorite GPU provider. All the links in this section are Amazon affiliate links.

GPU Selection of the Week:

RTX 4090 (24 GB): PNY GeForce RTX™ 4090 24GB Verto™
RTX 4080 SUPER (16 GB): GIGABYTE GeForce RTX 4080 Super WINDFORCE V2
RTX 4070 Ti SUPER (16 GB): MSI Gaming RTX 4070 Ti Super 16G AERO
RTX 4060 Ti (16 GB): PNY GeForce RTX™ 4060 Ti 16GB Verto™

The Salt

The Salt is my other newsletter that takes a more scientific approach. In The Salt, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week, I reviewed:

⭐Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
MoH: Multi-Head Attention as Mixture-of-Head Attention

The Salt - Curated AI

Mixture-of-Experts: Mixture-of-Head Attention and Embedding Model

Reviewed this week…

9 months ago · 2 likes · Benjamin Marie

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers (there is a 20% (or 30% for Pro subscribers) discount for group subscriptions):

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #64

Qwen2.5 4-bit - SynthID - Aya Expanse

New Qwen2.5 Models Quantized to 4-bit

Intel AutoRound: Accurate Low-bit Quantization for LLMs

QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

SynthID: AI Content Detection and Watermarking

Aya Expanse: The New Multilingual LLMs by Cohere

GPU Cost Tracker

GPU Benchmarking: What Is the Best GPU for LoRA, QLoRA, and Inference?

GPU Selection of the Week:

The Salt

Discussion about this post