Hi Everyone,
In this edition of The Weekly Kaitchup:
NVIDIA RTX 4070 Ti SUPER: A New 16 GB GPU
MEDUSA: Inference with Multiple Concurrent Heads
AirLLM: Layered Inference for Low-Memory Hardware
Low-memory Beam Search in Transformers
The Kaitchup has now 1,762 subscribers. Thanks a lot for your support!
I launched a new newsletter: The Salt - Curated AI Papers. The goal of this newsletter is to provide reviews and analyses of recently published AI papers, mainly around LLMs but without The Kaitchup’s focus on low-cost AI. There will be weekly reviews of AI papers and one or two monthly deep-dives into papers that are particularly popular and interesting.
I have already published one edition of The Weekly Salt, here:
The first deep dive will be published later next week. If you are interested in following closely what’s happening in AI from a scientific point of view, consider subscribing to The Salt:
NVIDIA RTX 4070 Ti SUPER: A New 16 GB GPU
The NVIDIA RTX 4060 Ti* is currently my favorite GPU for low-cost AI. With 16 GB of VRAM, this GPU can run 7 billion parameter LLMs without quantization, and 13 billion parameter LLMs with quantization, for less than $500. However, the RTX 4060 Ti remains a bit too slow for fine-tuning LLMs on large datasets in my opinion.
NVIDIA just released another 16 GB GPU: The RTX 4070 Ti SUPER. It’s unclear how much faster it is than the RTX 4060 Ti, but its specifications are promising:
It can be purchased for $799 which is significantly more expensive than the RTX 4060 Ti but I would expect this price to quickly drop in the following months.
I’ll update my “hardware for LLMs” guide once we know more about the performance of this GPU for deep learning.
*: Amazon affiliate link
MEDUSA: Inference with Multiple Concurrent Heads
MEDUSA introduces a method to accelerate LLM inference by integrating decoding heads that concurrently predict multiple tokens. These heads are fine-tuned in a parameter-efficient manner and can be easily added to existing LLMs.
Two fine-tuning procedures are proposed in this paper for these new heads. MEDUSA-1, suitable for resource-limited scenarios, minimizes memory requirements and can be optimized with quantization techniques. In contrast, MEDUSA-2 is recommended for scenarios with a lot of computational resources, employing a training protocol that enables joint training of MEDUSA heads and the base model without compromising next-token prediction capability.
The experiments focus on scenarios with a batch size of one. Testing on models of varying sizes and training settings, including Vicuna-7B, 13B, Vicuna-33B, and Zephyr-7B, MEDUSA achieves a speedup of 2.3 to 3.6 times across different prompt types without compromising generation quality.
The paper describing this approach is published on arXiv:
MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
There is also an implementation released by the authors:
AirLLM: Layered Inference for Low-Memory Hardware
With AirLLM, it is possible to run a 70B parameter model, such as Llama 2 70B, on consumer hardware without using quantization.
How does it work?
A 70B model, for instance, may consist of as many as 80 layers. During inference, the layers operate independently, relying solely on the output of the preceding layer. AirLLM exploits this mechanism by implementing a layered inference, releasing memory after processing each layer and keeping only the necessary layer's output.
Layers are executed sequentially, with each layer using the output of the previous one. Consequently, there's no need to retain all layers in GPU memory simultaneously, hence a lot of memory saving.
Instead, AirLLM loads the required layer from the disk during execution, performs calculations, and then entirely frees up the memory afterward. This optimized approach significantly reduces the GPU memory requirement per layer to approximately 1.6GB for a 70B parameter model, instead of the 120+ GB it would usually require.
Moreover, AirLLM incorporates additional optimizations to reduce further memory usage, such as FlashAttention, sharding model files by layers, and Accelerate’s device_map.
Of course, this layered inference severely impacts inference speed. I tried and I can say it makes LLMs unusable. For instance, it takes 1 minute to generate one token with Llama 2 7B on Google Colab’s T4. It probably targets much smaller LLMs. Yet, I think it is still quite impressive as it consumes less than 1 GB of memory.
Low-memory Beam Search in Transformers
Beam search improves inference but also consumes a lot of memory. Inference with a very large beam is often impractical.
A new beam search implementation has been added to Transformers that significantly reduces memory consumption.
You can use it by passing the argument “low_memory=True” to GenerationConfig.
How does it work?
Rather than executing beam search for each hypothesis in the beam in parallel, it divides the batch into smaller sequential reduced batches. It releases memory after processing each reduced batch. The resulting output is concatenated to achieve the exact same outcome as beam search.
Note that this is trading memory for computational time. It makes inference much slower.
That’s all for this week.
If you like reading The Kaitchup, consider sharing it with friends and coworkers:
Have a nice weekend!