Hi Everyone,
In this edition of The Weekly Kaitchup:
Embedding Quantization with Sentence-transformers
Evolutionary Optimization for Effective Model Merging
Intel’s Neural-Speed: Extremely Fast Inference on CPU with 4-bit LLMs
The Kaitchup has now 2,740 subscribers. Thanks a lot for your support!
If you are a free subscriber, consider upgrading to paid to access all the notebooks (50+) and more than 100 articles. There is a 7-day trial.
The yearly subscription is currently 38% cheaper than the monthly subscription.
Embedding Quantization with Sentence-transformers
Sentence-transformers now supports embedding quantization and GISTEmbedLoss. These new techniques can significantly accelerate and reduce the memory consumption of RAG systems.
Two primary types of quantization are available: binary and scalar (int8). They reduce the size of embedding values from float32 to either binary or int8 formats.
Binary quantization has been shown to preserve up to 96% of the retrieval performance and accelerate the retrieval by 25 times. From 32-bit to 1-bit (i.e., binary) parameters, the memory consumption is reduced by 32x.
This is very promising and will further alleviate one of the main weaknesses of RAG systems, i.e., their latency.
A blog post on Hugging Face shows how to use these new techniques with sentence-transformers:
Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval
Evolutionary Optimization for Effective Model Merging
In previous articles, I have shown how to merge LLMs into a single better LLM or as a mixture of experts. These techniques work so well that now most of the LLMs ranking at the top of leaderboards are merges of several other LLMs.
For even better merges, Sakana AI has introduced a new technique exploiting evolutionary algorithms:
Evolutionary Optimization of Model Merging Recipes
While previous techniques require a lot of tries to figure out which model combinations yield the best results, this new technique can effectively benefit from all the models merged, or just ignore the ones that are not useful.
This method works within both the parameter and data flow dimensions, doing more than just adjusting parameters. They first improved DARE TIES-merging, which was already one of the best methods for merging models, with a “more granular, layer-wise merging”. For the data flow dimension, the approach changes the path data takes, such as moving from one layer in one model to a different layer in another model. The merging algorithm searches for the best sequence of layers for data to pass through.
According to the experiments presented in the paper, this method yields better results than previous work. They also show that this method can be used to add new skills to a model by borrowing these skills from other models.
Code and models are available here:
Intel’s Neural-Speed: Extremely Fast Inference on CPU with 4-bit LLMs
Intel’s extension for Transformers is a very powerful library optimizing the use of LLMs on CPUs. In a previous article, I wrote about it to fine-tune LLMs with QLoRA using a CPU:
Intel regularly updates its extension for Transformers. Based on this extension, they released a new library, neural-speed, implementing highly optimized INT4 kernels for inference on CPUs with 4-bit LLMs.
GitHub: intel/neural-speed
They claim it makes inference up to 40x faster than llama.cpp, which also exploits optimized kernels for 4-bit inference on CPUs.
It already supports GPTQ and GGUF formats. It’s also very simple to use as it’s based on their extension for Transformers. We only need to import transformers from this extension, as follows:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
Evergreen Kaitchup
In this section of The Weekly Kaitchup, I mention which of the Kaitchup’s AI notebook(s) I have checked and updated, with a brief description of what I have done.
This week, I have checked and updated the following notebook:
#2 Fine-tuning GPT-NeoX-20B with QLoRA
One of my oldest notebooks running QLoRA fine-tuning with vanilla Transformers, i.e., without TRL.
The notebook now tests that your hardware is compatible with FlashAttention and bfloat16. It runs fine-tuning for one full epoch. I also removed “model.gradient_checkpointing_enable()” since gradient checkpointing is already enabled by “prepare_model_for_kbit_training”.
I have updated the related article to reflect these changes:
The Salt
The Salt is my other newsletter that takes a more scientific approach. In it, I primarily feature short reviews of recent papers (for free), detailed analyses of noteworthy publications, and articles centered on LLM evaluation.
This week in the Salt, I reviewed:
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Reverse Training to Nurse the Reversal Curse
⭐LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!