Hi Everyone,
In this edition of The Weekly Kaitchup:
ORPO: Preference Optimization without a Reference Model
EagleX 7B: A New Checkpoint for RWKV
Hugging Face’s Quanto to Quantize any LLMs
The Kaitchup has now 2,631 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks (50+) and more than 100 articles.
The yearly subscription is currently 38% cheaper than the monthly subscription.
ORPO: Preference Optimization without a Reference Model
There are many techniques for preference optimization (PO). PO is used to align LLMs with human preferences. Usually, PO requires at least two steps:
Supervised fine-tuning (SFT): The model learns to answer users’ instructions during this step.
PO training: Then, the model trained with SFT is refined to generate better answers according to humans (or according to better LLMs already aligned). The SFT model is also used as a “reference model” to make sure that PO doesn’t overfit.
PO is costly as we need to perform these two steps consecutively. It also consumes a lot of memory since we need the SFT model and the model trained by PO to be both loaded into memory at the same time.
We have already seen various techniques for PO training:
Reinforcement learning with human feedback (RLHF)
Direct preference optimization (DPO)
Identity preference optimization (IPO)
A new technique has been proposed by KAIST AI that seems to be simpler and to perform better:
ORPO: Monolithic Preference Optimization without Reference Model
ORPO can be directly trained with the same datasets used by DPO and IPO, i.e., datasets containing a pair of “good” and "bad” outputs for a given prompt. This technique doesn’t require the SFT phase, and thus, does not require a reference model either. It’s faster and more memory efficient.
The learning objective of ORPO is much more effective at penalizing the bad outputs. According to their experiments, the models trained with ORPO are better than models trained with DPO and RLHF:
Hugging Face is already working on adding ORPO to TRL, their library for fine-tuning LLMs. I’ll try it.
EagleX 7B: A New Checkpoint for RWKV
RWKV is an attention-free neural architecture. Since the computation of the attention in transformers is the main bottleneck, RWKV is 10x to 100x faster than the transformer for inference.
It remains to confirm that RWKV can be as good as, if not better than, state-of-the-art transformer models of similar size. The team behind RWKV had already released Eagle 7B, a pre-trained model that surpassed most other 7B models in multilingual tasks. Eagle 7B was trained on 1.1T tokens and is only an intermediate checkpoint of ongoing training.
A new checkpoint trained on 1.7T tokens, EagleX 7B, has been released. It is available on the Hugging Face hub:
It has been announced in this Substack article:
According to their own evaluation, this checkpoint is now surpassing in English tasks all other 7B models used in their comparison, except Mistral 7B.
It seems clear that the final checkpoint of RWKV v5, trained on 2T tokens, will perform at least on par with Mistral 7B.
Hopefully, this new achievement by RWKV will motivate the AI community to better support this architecture. To the best of my knowledge, fine-tuning RWKV on consumer hardware with QLoRA, or quantization with GPTQ and AWQ, is still impossible.
In The Salt, I wrote an extensive review explaining RWKV and showing how to use it:
Hugging Face’s Quanto to Quantize any LLMs
Hugging Face has released a new library for model quantization with Pytorch:
GitHub: huggingface/quanto
The main features:
available in eager mode (works with non-traceable models)
quantized models can be placed on any device (including CUDA and MPS),
automatically inserts quantization and dequantization stubs,
automatically inserts quantized functional operations,
automatically inserts quantized modules (see below the list of supported modules),
provides a seamless workflow for a float model, going from a dynamic to a static quantized model,
supports quantized model serialization as a
state_dict
,supports not only
int8
weights, but alsoint2
andint4
,supports not only
int8
activations, but alsofloat8
.
There are already numerous libraries for quantization but most of them are only compatible with specific models architectures and devices. With Quanto, you can quantize any Pytorch models and run them quantized on any device.
It only implements simple quantization schemes (linear quantization, per-group quantization), similar to what llama.cpp does with GGUF models, that might be enough for most use cases.
They published a blog post describing how to use Quanto:
Quanto: a pytorch quantization toolkit
Evergreen Kaitchup
In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) I have checked and updated, with a brief description of what I have done.
This week, I have checked and updated the following notebook:
#28 QLoRA Fine-tuning with FlashAttention-2 — Examples with Llama 2 7b
The notebook now tests that your hardware is compatible with FlashAttention. The comparisons between QLoRA fine-tuning with and without FlashAttention are conducted over more training steps. For these comparisons, I use as baselines the Pytorch implementation of scale dot-product attention (SDPA) and the original “eager” attention implementation. FlashAttention appears 1.1x faster than SDPA and 1.5x faster than “eager”. It looks like a modest acceleration but it may save several hours, if not days, for long fine-tuning.
I have also updated the related article to reflect these changes:
The Salt
The Salt is my other newsletter that takes a more scientific approach. In it, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week in The Salt, I shared a new article that shows how simple it is to bump up benchmark scores for LLMs by mixing in the benchmark data. I believe that the effect of this mix-up, or data contamination, which could be accidental, is largely underestimated when evaluating LLMs.
I also reviewed:
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
⭐Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
Stealing Part of a Production Language Model
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!
There is something i don’t understand
You keep the sft model as reference and in arguilla blog they keep the original base model
What is the best way to ise DPO please ?
reference
«
Finally, they describe with a few lines of code, how you can configure a DPOTrainer class and run the train. Here is what you will need:
model, the fine-tuned version of your model (the result from SFT);
model_ref, the non-fine-tuned version of the model that's being fine-tuned. Usually it’s the original checkpoint you used before SFT.
training_args, same TrainerArguments class object present in transformers library, containing a list of training parameters such as per_device_train_batch_size, max_steps, gradient_accumulation_steps, learning_rate, evaluation_strategy, output_dir, etc.
beta, temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. »
https://argilla.io/blog/mantisnlp-rlhf-part-3/