Run and Serve Faster VLMs Like Pixtral and Phi-3.5 Vision with vLLM
Understanding how much memory you need to serve a VLM
vLLM is currently one of the fastest inference engines for large language models (LLMs). It supports a wide range of model architectures and quantization methods. In addition, we saw that it can efficiently serve models equipped with multiple LoRA adapters.
vLLM also supports vision-language models (VLMs) with multimodal inputs containing both images and text prompts. For instance, vLLM can now serve models like Phi-3.5 Vision and Pixtral, which excel at tasks such as image captioning, optical character recognition (OCR), and visual question answering (VQA).
In this article, I will show you how to use VLMs with vLLM, focusing on key parameters that impact memory consumption. We will see why VLMs consume much more memory than standard LLMs. We’ll use Phi-3.5 Vision and Pixtral as case studies for a multimodal application that processes prompts containing text and images.
The code for running Phi-3.5 Vision and Pixtral with vLLM is provided in this notebook: