The Weekly Kaitchup #8

Mistral 7B - LongLoRA - LoRA for RLHF - WMT23

Sep 29, 2023

Hi Everyone,

In this edition of The Weekly Kaitchup:

Mistral 7B: A very fast model that outperforms Llama 2 13B and LLaMa 34B
Extend your LLM’s context size easily with LongLoRA
KL divergence penalty in RLHF might be useless if used with LoRA
The automatic evaluation results of the WMT23 general machine translation task

The Kaitchup has now 651 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks and articles. There is a 7-day trial that you can cancel anytime.

If you are a monthly paid subscriber, switch to a yearly subscription to get a 17% discount (2 months free)!

Mistral 7B: The First LLM by Mistral

I was eagerly waiting for this one. Mistral AI is a young company founded by very talented researchers.

Mistral 7B is another LLM outperforming the larger Llama 2 13B. Mistral AI improved several open source projects to develop and train Mistral 7B.

In particular, they made changes to FlashAttention and xFormers to make the model twice faster while being able to deal with sequences of up to 16k tokens.

The speed-up is achieved by using a sliding window when computing the attention:

Attention through layers — Illustration by Mistral

They have also implemented a rolling buffer and a pre-filled cache for efficiency. You will find more details on GitHub.

Two models are released by Mistral AI:

mistralai/Mistral-7B-v0.1 (a pre-trained model)
mistralai/Mistral-7B-Instruct-v0.1 (a chat model)

Mistral AI is “committing to open models” but, as far as I know, they didn’t disclose their training data.

LongLoRA: A Low-Cost Method to Extend Context Size

A new approach to extend the context sizes of pre-trained LLMs using LoRA and sparse local attention.

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (Chen et al., 2023)

Extending the context size of LLMs is challenging particularly due to the computation of the attention. This computation has a quadratic complexity. For instance, if you want to naively increase the context length of Llama 2, which is 2048, to 4096, it would increase by 4 the computational cost.

Sparse local attention, denoted “shift short attention” by the authors, is one way to reduce this cost.

The integration of shift short attention to existing LLMs is possible thanks to LoRA. This is a slightly more expensive LoRA than usual since it requires retraining the parameters of the embeddings and normalization layers. For large LLMs, it could be LoRA adapters with a billion parameters.

With LongLoRA, the performance of the model is similar to a model pre-trained with the same context size. LongLoRA is memory-efficient and converges faster during training.

LoRA-based RLHF

An interesting study on using LoRA for RLHF to train instruct LLMs:

Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF (Sun et al., 2023)

They found that the KL divergence penalty used in RLHF doesn’t improve RLHF when using LoRA. The authors assume that this is because LoRA already acts as a “powerful regularizer”.

If we don’t need the KL penalty, we also don’t need to keep in memory the initial model trained with SFT (also called the reference model). This is a significant simplification of the RLHF framework.

In a previous article using DeepSpeed Chat, I also used LoRA for RLHF. You can find the article here (also a good read if you want to know more about RLHF and the role of this KL penalty):

Train Instruct LLMs On Your GPU with DeepSpeed Chat — Step #3: Reinforcement Learning with Human Feedback

September 21, 2023

Read full story

WMT23: The Automatic Evaluation of the General Machine Translation Task

WMT organizes every year an international competition where top research institutions submit their best translation systems.

As expected, at WMT23, many of the translation systems submitted involved large language models.

I’m a co-organizer of WMT23. I helped with the automatic evaluation. The results of this evaluation are here:

Unofficial Automatic Evaluation of WMT23

Note: We called it “unofficial” since WMT officially relies on human evaluation for ranking the systems.

For instance, for English→Japanese and Russian→English, we have the following rankings:

We are currently analyzing these results and finalizing the human evaluation. We will present the results at WMT23, co-located with EMNLP 2023, in Singapore on 6-7 December.

Note: Ignore BLEU and chrF scores which are legacy metrics for evaluating machine translation quality.

12 Critical Flaws of BLEU

Benjamin Marie, PhD

December 16, 2022

BLEU is an extremely popular evaluation metric for AI. It was originally proposed 20 years ago for machine translation evaluation, but it is nowadays commonly used in many natural language processing (NLP) tasks. BLEU has also been recently used to evaluate large language models, such as

Read full story

I’m currently working on cleaning and documenting the automatic evaluation code. We will publish this code to keep the evaluation reproducible.

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget