The Weekly Kaitchup #33

ORPO - EagleX 7B - Quanto

Benjamin Marie

Mar 22, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

ORPO: Preference Optimization without a Reference Model
EagleX 7B: A New Checkpoint for RWKV
Hugging Face’s Quanto to Quantize any LLMs

The Kaitchup has now 2,631 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks (50+) and more than 100 articles.

The yearly subscription is currently 38% cheaper than the monthly subscription.

ORPO: Preference Optimization without a Reference Model

There are many techniques for preference optimization (PO). PO is used to align LLMs with human preferences. Usually, PO requires at least two steps:

Supervised fine-tuning (SFT): The model learns to answer users’ instructions during this step.
PO training: Then, the model trained with SFT is refined to generate better answers according to humans (or according to better LLMs already aligned). The SFT model is also used as a “reference model” to make sure that PO doesn’t overfit.

PO is costly as we need to perform these two steps consecutively. It also consumes a lot of memory since we need the SFT model and the model trained by PO to be both loaded into memory at the same time.

We have already seen various techniques for PO training:

Reinforcement learning with human feedback (RLHF)

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

September 21, 2023

Read full story

Direct preference optimization (DPO)

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Benjamin Marie

October 26, 2023

Read full story

Identity preference optimization (IPO)

Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)

Benjamin Marie

December 7, 2023

Read full story

A new technique has been proposed by KAIST AI that seems to be simpler and to perform better:

ORPO: Monolithic Preference Optimization without Reference Model

ORPO can be directly trained with the same datasets used by DPO and IPO, i.e., datasets containing a pair of “good” and "bad” outputs for a given prompt. This technique doesn’t require the SFT phase, and thus, does not require a reference model either. It’s faster and more memory efficient.

The learning objective of ORPO is much more effective at penalizing the bad outputs. According to their experiments, the models trained with ORPO are better than models trained with DPO and RLHF:

Hugging Face is already working on adding ORPO to TRL, their library for fine-tuning LLMs. I’ll try it.

EagleX 7B: A New Checkpoint for RWKV

RWKV is an attention-free neural architecture. Since the computation of the attention in transformers is the main bottleneck, RWKV is 10x to 100x faster than the transformer for inference.

It remains to confirm that RWKV can be as good as, if not better than, state-of-the-art transformer models of similar size. The team behind RWKV had already released Eagle 7B, a pre-trained model that surpassed most other 7B models in multilingual tasks. Eagle 7B was trained on 1.1T tokens and is only an intermediate checkpoint of ongoing training.

A new checkpoint trained on 1.7T tokens, EagleX 7B, has been released. It is available on the Hugging Face hub:

recursal/EagleX_1-7T

It has been announced in this Substack article:

Recursal AI development blog

🦅 EagleX 1.7T : Soaring past LLaMA 7B 2T in both English and Multi-lang evals (RWKV-v5)

If you are fine-tuning, we recommend waiting for the full EagleX 2T model coming out later this month instead, unless you are doing so for research purpose. This model is released for research purposes, as it represents the major checkpoint that surpasses LLaMA2 7B, as part of our current training to 2T tokens and beyond…

a year ago · 4 likes · 3 comments · Eugene Cheah

According to their own evaluation, this checkpoint is now surpassing in English tasks all other 7B models used in their comparison, except Mistral 7B.

It seems clear that the final checkpoint of RWKV v5, trained on 2T tokens, will perform at least on par with Mistral 7B.

Hopefully, this new achievement by RWKV will motivate the AI community to better support this architecture. To the best of my knowledge, fine-tuning RWKV on consumer hardware with QLoRA, or quantization with GPTQ and AWQ, is still impossible.

In The Salt, I wrote an extensive review explaining RWKV and showing how to use it:

RWKV: As Good as the Transformer But Faster?

Benjamin Marie

February 13, 2024

Read full story

Hugging Face’s Quanto to Quantize any LLMs

Hugging Face has released a new library for model quantization with Pytorch:

GitHub: huggingface/quanto

The main features:

available in eager mode (works with non-traceable models)
quantized models can be placed on any device (including CUDA and MPS),
automatically inserts quantization and dequantization stubs,
automatically inserts quantized functional operations,
automatically inserts quantized modules (see below the list of supported modules),
provides a seamless workflow for a float model, going from a dynamic to a static quantized model,
supports quantized model serialization as a state_dict,
supports not only int8 weights, but also int2 and int4,
supports not only int8 activations, but also float8.

There are already numerous libraries for quantization but most of them are only compatible with specific models architectures and devices. With Quanto, you can quantize any Pytorch models and run them quantized on any device.

It only implements simple quantization schemes (linear quantization, per-group quantization), similar to what llama.cpp does with GGUF models, that might be enough for most use cases.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Benjamin Marie

February 29, 2024

Read full story

They published a blog post describing how to use Quanto:

Quanto: a pytorch quantization toolkit

Evergreen Kaitchup

In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) I have checked and updated, with a brief description of what I have done.

This week, I have checked and updated the following notebook:

#28 QLoRA Fine-tuning with FlashAttention-2 — Examples with Llama 2 7b

The notebook now tests that your hardware is compatible with FlashAttention. The comparisons between QLoRA fine-tuning with and without FlashAttention are conducted over more training steps. For these comparisons, I use as baselines the Pytorch implementation of scale dot-product attention (SDPA) and the original “eager” attention implementation. FlashAttention appears 1.1x faster than SDPA and 1.5x faster than “eager”. It looks like a modest acceleration but it may save several hours, if not days, for long fine-tuning.

I have also updated the related article to reflect these changes:

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie

November 16, 2023

Read full story

The Salt

The Salt is my other newsletter that takes a more scientific approach. In it, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week in The Salt, I shared a new article that shows how simple it is to bump up benchmark scores for LLMs by mixing in the benchmark data. I believe that the effect of this mix-up, or data contamination, which could be accidental, is largely underestimated when evaluating LLMs.

The Salt - Curated AI

Contaminated LLMs: What Happens When You Train an LLM on the Evaluation Benchmarks?

When they are released, large language models (LLMs) are (almost) always evaluated on the same benchmarks for commonsense reasoning, reading comprehension, general knowledge, etc. For instance, Winogrande, MMLU, GS8MK, and HellaSwag are almost always used. If an LLM obtains better scores on these benchmarks, on average, it will be considered better…

a year ago · 1 like · Benjamin Marie

I also reviewed:

Recurrent Drafter for Fast Speculative Decoding in Large Language Models
⭐Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
Stealing Part of a Production Language Model

The Salt - Curated AI

BurstAttention for Very Long Sequences and Faster Speculative Decoding with ReDrafter

Reviewed this week Recurrent Drafter for Fast Speculative Decoding in Large Language Models ⭐Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences…

a year ago · 1 like

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

baconnier loic

Mar 23, 2024

There is something i don’t understand

You keep the sft model as reference and in arguilla blog they keep the original base model

What is the best way to ise DPO please ?

reference

Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:

model, the fine-tuned version of your model (the result from SFT);

model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.

training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.

beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. »

https://argilla.io/blog/mantisnlp-rlhf-part-3/

Expand full comment

2 replies by Benjamin Marie

2 more comments...

The Kaitchup – AI on a Budget

The Weekly Kaitchup #33

ORPO - EagleX 7B - Quanto

ORPO: Preference Optimization without a Reference Model

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)

Fine-tune Better Chat Models with Distilled Identity Preference Optimization (IPO)

EagleX 7B: A New Checkpoint for RWKV

RWKV: As Good as the Transformer But Faster?

Hugging Face’s Quanto to Quantize any LLMs

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Evergreen Kaitchup

Use FlashAttention-2 for Faster Fine-tuning and Inference

The Salt

Discussion about this post