The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fast Speculative Decoding with Llama 3.2 and vLLM
Copy link
Facebook
Email
Notes
More

Fast Speculative Decoding with Llama 3.2 and vLLM

Boost LLM inference speed with speculative decoding!

Benjamin Marie's avatar
Benjamin Marie
Oct 14, 2024
∙ Paid
11

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fast Speculative Decoding with Llama 3.2 and vLLM
Copy link
Facebook
Email
Notes
More
5
1
Share
A speculating Llama — Generated with Grok

Speculative decoding (a.k.a. assisted decoding) is a method for speeding up inference in language models. It uses two models: a smaller draft model to generate token suggestions and a larger main model to validate them.

If most of the draft model's tokens are accurate, the process can greatly accelerate inference, as the main model only needs to perform a quick validation pass. However, if the main model has to correct many tokens, speculative decoding can be slower than using the main model alone. The key to its success lies in selecting an optimal pair of models, ideally trained on the same data and using the same tokenizer, with the draft model significantly smaller than the main model.

With the release of Llama 3.2 1B and 3B, products of Llama 3.1 distillation, we may have good candidate draft models for speculative decoding targeting Llama 3.1 8B or 70B.

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Fine-Tuning Meta's Llama 3.2 1B & 3B Models on Budget GPUs

Benjamin Marie
·
September 30, 2024
Read full story

In this article, we will experiment with speculative decoding using Llama 3.2 as draft models and Llama 3.1 as target models to see whether we can significantly accelerate inference. We will use vLLM.

The code for speculative decoding using vLLM and Llama 3.2 is implemented in this notebook:

Get the notebook (#12)

Speculative Decoding with vLLM

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More