Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Understanding how much memory you need to serve a VLM

Sep 19, 2024

∙ Paid

vLLM is currently one of the fastest inference engines for large language models (LLMs). It supports a wide range of model architectures and quantization methods. In addition, we saw that it can efficiently serve models equipped with multiple LoRA adapters.

Serve Multiple LoRA Adapters with vLLM

Benjamin Marie

August 1, 2024

Read full story

vLLM also supports vision-language models (VLMs) with multimodal inputs containing both images and text prompts. For instance, vLLM can now serve models like Phi-3.5 Vision and Pixtral, which excel at tasks such as image captioning, optical character recognition (OCR), and visual question answering (VQA).

In this article, I will show you how to use VLMs with vLLM, focusing on key parameters that impact memory consumption. We will see why VLMs consume much more memory than standard LLMs. We’ll use Phi-3.5 Vision and Pixtral as case studies for a multimodal application that processes prompts containing text and images.

The code for running Phi-3.5 Vision and Pixtral with vLLM is provided in this notebook:

Get the notebook (#105)

The Kaitchup – AI on a Budget

Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM

Understanding how much memory you need to serve a VLM

Serve Multiple LoRA Adapters with vLLM

This post is for paid subscribers