Fast Speculative Decoding with Llama 3.2 and vLLM
Boost LLM inference speed with speculative decoding!
Speculative decoding (a.k.a. assisted decoding) is a method for speeding up inference in language models. It uses two models: a smaller draft model to generate token suggestions and a larger main model to validate them.
If most of the draft model's tokens are accurate, the process can greatly accelerate inference, as the main model only needs to perform a quick validation pass. However, if the main model has to correct many tokens, speculative decoding can be slower than using the main model alone. The key to its success lies in selecting an optimal pair of models, ideally trained on the same data and using the same tokenizer, with the draft model significantly smaller than the main model.
With the release of Llama 3.2 1B and 3B, products of Llama 3.1 distillation, we may have good candidate draft models for speculative decoding targeting Llama 3.1 8B or 70B.
In this article, we will experiment with speculative decoding using Llama 3.2 as draft models and Llama 3.1 as target models to see whether we can significantly accelerate inference. We will use vLLM.
The code for speculative decoding using vLLM and Llama 3.2 is implemented in this notebook: