The Weekly Kaitchup #3

AutoGPTQ in HF transformers - More long-context LLMs - Survey on compression techniques

Aug 25, 2023

Hi Everyone,

This week has been quieter than usual with only a few new models, announcements, and research papers that I’ve noticed.

But I’m very excited by the integration of AutoGPTQ into Hugging Face’s transformers that was released this week. Now, you can directly load GPTQ quantized models from the HF Hub with transformers. I think this is a game-changer for affordable AI as it greatly improves the accessibility of quantized models.

I’m writing a tutorial on how to use GPTQ models directly with HF transformers for inference and fine-tuning (with PEFT). I’m also benchmarking it on affordable GPUs. Expect it next week in your mailbox!

From next week, I will start publishing a series of articles showing how to train an RLHF model on your computer with DeepSpeed-Chat. It’s complicated but I’ll try to make it simple. The first article of this series will be free, while the following ones will be accessible only to paid subscribers. If you are a free subscriber, consider upgrading to paid. There is a 7-day free trial:

Get the Free Trial

The Kaitchup has now 347 subscribers!

Thank you for your support!

In The Weekly Kaitchup, I briefly comment on recent scientific papers, new frameworks, tips, open models/datasets, etc. for affordable and computationally efficient AI.

Share The Kaitchup

AutoGPTQ with Hugging Face’s transformers

As I wrote in the introduction, I think this is a game-changer considering the huge number of models already quantized and freely available on the Hugging Face Hub.

I’ll publish a detailed tutorial on how to use it. Meanwhile, here is the piece of code that will load a GPTQ quantized model on your computer:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("kaitchup/llama-2-7b-4bit-autogptq", device_map="auto")

That’s extremely simple! Quantized models are loaded as simply as standard models.

This integration of GPTQ into transformers was announced 1 day after I finished writing my article comparing GPTQ and bitsandbytes. I updated the article yesterday (just to add a note about it).

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2

Benjamin Marie, PhD

August 22, 2023

Read full story

Giraffe — Long context open-source LLMs

Abacus.ai released new models greatly extending the context length of Llama 2 to 32k tokens. That’s 30k tokens longer than the original model. Their approach is explained in this paper: “Giraffe: Adventures in Expanding Context Lengths in LLMs”.

If you want to read a shorter and more accessible version of their work, I recommend reading their blog post. I like how they simply explain the scalability problem of the Transformer with a long context:

Why can’t we just train the model on longer contexts though? The primary reason for this is that a key component of modern LLM architecture – called self-attention – scales quadratically in both memory and compute as context length increases, so there will quickly be a point where you don’t have sufficient GPUs, time or money to train for longer contexts. Hence having a method that can zero-shot extrapolate to context lengths never seen before is key.

You can see this work as a survey comparing existing techniques and highlighting their advantages/disadvantages. They also propose a new “truncation“ method that seems promising.

They release their code to extend the context length of LLaMa and Llama 2 on GitHub (Apache 2.0 license).

The Giraffe models are available on the Hugging Face Hub. The 32k Giraffe, based on Llama 2 13b, is here.

In the previous edition of The Weekly Kaitchup, I wrote about Unlimiformer, a wrapper giving unlimited context length to LLM. I wonder how this compares to the approaches studied by Abacus.ai.