The Weekly Kaitchup #34

Embedding quantization - Better merge - Neural Speed

Benjamin Marie

Mar 29, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

Embedding Quantization with Sentence-transformers
Evolutionary Optimization for Effective Model Merging
Intel’s Neural-Speed: Extremely Fast Inference on CPU with 4-bit LLMs

The Kaitchup has now 2,740 subscribers. Thanks a lot for your support!

If you are a free subscriber, consider upgrading to paid to access all the notebooks (50+) and more than 100 articles. There is a 7-day trial.

The yearly subscription is currently 38% cheaper than the monthly subscription.

Embedding Quantization with Sentence-transformers

Sentence-transformers now supports embedding quantization and GISTEmbedLoss. These new techniques can significantly accelerate and reduce the memory consumption of RAG systems.

RAG for Mistral 7B Instruct with LlamaIndex and Transformers

Benjamin Marie

March 25, 2024

Read full story

Two primary types of quantization are available: binary and scalar (int8). They reduce the size of embedding values from float32 to either binary or int8 formats.

Binary quantization has been shown to preserve up to 96% of the retrieval performance and accelerate the retrieval by 25 times. From 32-bit to 1-bit (i.e., binary) parameters, the memory consumption is reduced by 32x.

This is very promising and will further alleviate one of the main weaknesses of RAG systems, i.e., their latency.

A blog post on Hugging Face shows how to use these new techniques with sentence-transformers:

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Evolutionary Optimization for Effective Model Merging

In previous articles, I have shown how to merge LLMs into a single better LLM or as a mixture of experts. These techniques work so well that now most of the LLMs ranking at the top of leaderboards are merges of several other LLMs.

Maixtchup: Make Your Own Mixture of Experts with Mergekit

Benjamin Marie

January 18, 2024

Read full story

The Mayonnaise: Rank First on the Open LLM Leaderboard with TIES-Merging

Benjamin Marie

January 29, 2024

Read full story

For even better merges, Sakana AI has introduced a new technique exploiting evolutionary algorithms:

Evolutionary Optimization of Model Merging Recipes

While previous techniques require a lot of tries to figure out which model combinations yield the best results, this new technique can effectively benefit from all the models merged, or just ignore the ones that are not useful.

This method works within both the parameter and data flow dimensions, doing more than just adjusting parameters. They first improved DARE TIES-merging, which was already one of the best methods for merging models, with a “more granular, layer-wise merging”. For the data flow dimension, the approach changes the path data takes, such as moving from one layer in one model to a different layer in another model. The merging algorithm searches for the best sequence of layers for data to pass through.

According to the experiments presented in the paper, this method yields better results than previous work. They also show that this method can be used to add new skills to a model by borrowing these skills from other models.

Code and models are available here:

GitHub: SakanaAI/evolutionary-model-merge

Intel’s Neural-Speed: Extremely Fast Inference on CPU with 4-bit LLMs

Intel’s extension for Transformers is a very powerful library optimizing the use of LLMs on CPUs. In a previous article, I wrote about it to fine-tune LLMs with QLoRA using a CPU:

Fine-tune LLMs on Your CPU with QLoRA

Benjamin Marie

January 4, 2024

Read full story

Intel regularly updates its extension for Transformers. Based on this extension, they released a new library, neural-speed, implementing highly optimized INT4 kernels for inference on CPUs with 4-bit LLMs.

GitHub: intel/neural-speed

They claim it makes inference up to 40x faster than llama.cpp, which also exploits optimized kernels for 4-bit inference on CPUs.

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU

Benjamin Marie

February 29, 2024

Read full story

It already supports GPTQ and GGUF formats. It’s also very simple to use as it’s based on their extension for Transformers. We only need to import transformers from this extension, as follows:

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Evergreen Kaitchup

In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) I have checked and updated, with a brief description of what I have done.

This week, I have checked and updated the following notebook:

#2 Fine-tuning GPT-NeoX-20B with QLoRA

One of my oldest notebooks running QLoRA fine-tuning with vanilla Transformers, i.e., without TRL.

The notebook now tests that your hardware is compatible with FlashAttention and bfloat16. It runs fine-tuning for one full epoch. I also removed “model.gradient_checkpointing_enable()” since gradient checkpointing is already enabled by “prepare_model_for_kbit_training”.

I have updated the related article to reflect these changes:

QLoRA: Fine-Tune a Large Language Model on Your GPU

Benjamin Marie

May 30, 2023

Read full story

The Salt

The Salt is my other newsletter that takes a more scientific approach. In it, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.

This week in the Salt, I reviewed:

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Reverse Training to Nurse the Reversal Curse
⭐LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
PERL: Parameter Efficient Reinforcement Learning from Human Feedback