Fast Speculative Decoding with Llama 3.2 and vLLM

Boost LLM inference speed with speculative decoding!

Oct 14, 2024

∙ Paid

A speculating Llama — Generated with Grok

Speculative decoding (a.k.a. assisted decoding) is a method for speeding up inference in language models. It uses two models: a smaller draft model to generate token suggestions and a larger main model to validate them.

If most of the draft model's tokens are accurate, the process can greatly accelerate inference, as the main model only needs to perform a quick validation pass. However, if the main model has to correct many tokens, speculative decoding can be slower than using the main model alone. The key to its success lies in selecting an optimal pair of models, ideally trained on the same data and using the same tokenizer, with the draft model significantly smaller than the main model.

With the release of Llama 3.2 1B and 3B, products of Llama 3.1 distillation, we may have good candidate draft models for speculative decoding targeting Llama 3.1 8B or 70B.

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Benjamin Marie

September 30, 2024

Read full story

In this article, we will experiment with speculative decoding using Llama 3.2 as draft models and Llama 3.1 as target models to see whether we can significantly accelerate inference. We will use vLLM.

The code for speculative decoding using vLLM and Llama 3.2 is implemented in this notebook:

Get the notebook (#12)

The Kaitchup – AI on a Budget

Fast Speculative Decoding with Llama 3.2 and vLLM

Boost LLM inference speed with speculative decoding!

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Speculative Decoding with vLLM

This post is for paid subscribers