The Weekly Kaitchup #25

RTX 4070 Ti SUPER - Medusa - AirLLM - Low-memory Beam Search

Benjamin Marie

Jan 26, 2024

Hi Everyone,

In this edition of The Weekly Kaitchup:

NVIDIA RTX 4070 Ti SUPER: A New 16 GB GPU
MEDUSA: Inference with Multiple Concurrent Heads
AirLLM: Layered Inference for Low-Memory Hardware
Low-memory Beam Search in Transformers

The Kaitchup has now 1,762 subscribers. Thanks a lot for your support!

I launched a new newsletter: The Salt - Curated AI Papers. The goal of this newsletter is to provide reviews and analyses of recently published AI papers, mainly around LLMs but without The Kaitchup’s focus on low-cost AI. There will be weekly reviews of AI papers and one or two monthly deep-dives into papers that are particularly popular and interesting.

I have already published one edition of The Weekly Salt, here:

Asynchronous Local-SGD and Self-Rewarding LLMs

Benjamin Marie

January 24, 2024

Read full story

The first deep dive will be published later next week. If you are interested in following closely what’s happening in AI from a scientific point of view, consider subscribing to The Salt:

Subscribe to The Salt

NVIDIA RTX 4070 Ti SUPER: A New 16 GB GPU

The NVIDIA RTX 4060 Ti* is currently my favorite GPU for low-cost AI. With 16 GB of VRAM, this GPU can run 7 billion parameter LLMs without quantization, and 13 billion parameter LLMs with quantization, for less than $500. However, the RTX 4060 Ti remains a bit too slow for fine-tuning LLMs on large datasets in my opinion.

NVIDIA just released another 16 GB GPU: The RTX 4070 Ti SUPER. It’s unclear how much faster it is than the RTX 4060 Ti, but its specifications are promising:

It can be purchased for $799 which is significantly more expensive than the RTX 4060 Ti but I would expect this price to quickly drop in the following months.

I’ll update my “hardware for LLMs” guide once we know more about the performance of this GPU for deep learning.

Hardware for LLMs

*: Amazon affiliate link

MEDUSA: Inference with Multiple Concurrent Heads

MEDUSA introduces a method to accelerate LLM inference by integrating decoding heads that concurrently predict multiple tokens. These heads are fine-tuned in a parameter-efficient manner and can be easily added to existing LLMs.

Two fine-tuning procedures are proposed in this paper for these new heads. MEDUSA-1, suitable for resource-limited scenarios, minimizes memory requirements and can be optimized with quantization techniques. In contrast, MEDUSA-2 is recommended for scenarios with a lot of computational resources, employing a training protocol that enables joint training of MEDUSA heads and the base model without compromising next-token prediction capability.

The experiments focus on scenarios with a batch size of one. Testing on models of varying sizes and training settings, including Vicuna-7B, 13B, Vicuna-33B, and Zephyr-7B, MEDUSA achieves a speedup of 2.3 to 3.6 times across different prompt types without compromising generation quality.

The paper describing this approach is published on arXiv:

MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

There is also an implementation released by the authors:

FasterDecoding/Medusa

AirLLM: Layered Inference for Low-Memory Hardware

With AirLLM, it is possible to run a 70B parameter model, such as Llama 2 70B, on consumer hardware without using quantization.

AirLLM on GitHub

How does it work?

A 70B model, for instance, may consist of as many as 80 layers. During inference, the layers operate independently, relying solely on the output of the preceding layer. AirLLM exploits this mechanism by implementing a layered inference, releasing memory after processing each layer and keeping only the necessary layer's output.

Layers are executed sequentially, with each layer using the output of the previous one. Consequently, there's no need to retain all layers in GPU memory simultaneously, hence a lot of memory saving.

Instead, AirLLM loads the required layer from the disk during execution, performs calculations, and then entirely frees up the memory afterward. This optimized approach significantly reduces the GPU memory requirement per layer to approximately 1.6GB for a 70B parameter model, instead of the 120+ GB it would usually require.

Moreover, AirLLM incorporates additional optimizations to reduce further memory usage, such as FlashAttention, sharding model files by layers, and Accelerate’s device_map.

Use FlashAttention-2 for Faster Fine-tuning and Inference

Benjamin Marie

November 16, 2023

Read full story

Of course, this layered inference severely impacts inference speed. I tried and I can say it makes LLMs unusable. For instance, it takes 1 minute to generate one token with Llama 2 7B on Google Colab’s T4. It probably targets much smaller LLMs. Yet, I think it is still quite impressive as it consumes less than 1 GB of memory.

Low-memory Beam Search in Transformers

Beam search improves inference but also consumes a lot of memory. Inference with a very large beam is often impractical.

A new beam search implementation has been added to Transformers that significantly reduces memory consumption.

You can use it by passing the argument “low_memory=True” to GenerationConfig.

How does it work?

Rather than executing beam search for each hypothesis in the beam in parallel, it divides the batch into smaller sequential reduced batches. It releases memory after processing each reduced batch. The resulting output is concatenated to achieve the exact same outcome as beam search.

Note that this is trading memory for computational time. It makes inference much slower.

That’s all for this week.

If you like reading The Kaitchup, consider sharing it with friends and coworkers:

Share The Kaitchup – AI on a Budget

Have a nice weekend!

The Kaitchup – AI on a Budget

The Weekly Kaitchup #25

RTX 4070 Ti SUPER - Medusa - AirLLM - Low-memory Beam Search

Asynchronous Local-SGD and Self-Rewarding LLMs

NVIDIA RTX 4070 Ti SUPER: A New 16 GB GPU

*: Amazon affiliate link

MEDUSA: Inference with Multiple Concurrent Heads

AirLLM: Layered Inference for Low-Memory Hardware

Use FlashAttention-2 for Faster Fine-tuning and Inference

Low-memory Beam Search in Transformers

Discussion about this post