The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer

Offline inference and serving with quantized models

Benjamin Marie's avatar
Benjamin Marie
Feb 15, 2024
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
vLLM: Serve Fast Mistral 7B and Llama 2 Models from Your Computer
6
1
Share

vLLM is one the fastest frameworks that you can find for serving large language models (LLMs). It implements many inference optimizations, including custom CUDA kernels and pagedAttention, and supports various model architectures, such as Falcon, Llama 2, Mistral 7B, Qwen, and more. These models can be served quantized and with LoRA adapters.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I present vLLM and demonstrate how to serve Mistral 7B and Llama 2, quantized with AWQ and SqueezeLLM, from your computer. I show how to do it offline and with a vLLM local server running in the background. Note that, while I use Mistral 7B and Llama 2 7B in this article, it would work the same for the other LLMs supported by vLLM.

You can replicate my experiments by running this notebook.

Get the notebook (#44)

Note: I demonstrate how to use vLLM using an NVIDIA GPU, but vLLM also supports AMD GPUs with ROCm.

Paging Attention for Faster Inference with vLLM

Back in June 2023, I first wrote about vLLM in this article:

vLLM: PagedAttention for 24x Faster LLM Inference

vLLM: PagedAttention for 24x Faster LLM Inference

Benjamin Marie
·
June 24, 2023
Read full story

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share