Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step
Your local multimodal chat model
Alibaba’s Qwen2-VL are vision language models now available as 2B and 7B parameter models. They are generative language models that support multimodal inputs. You can provide Qwen2-VL, along with text, a single image, multiple images, or a 20-minute video!
The models demonstrate impressive performance in visual understanding. Like Microsoft’s Florence-2, Qwen2-VL can perform many types of tasks such as OCR, image captioning, question answering, visual grounding, etc.
Quantized versions, using the AWQ and GPTQ formats, were also published by Alibaba to facilitate deployment on smaller GPUs.
In this article, I first review Qwen2-VL’s architecture and performance. Then, we will see how to use Qwen2-VL with multiple images and videos using small GPUs (8 GB and 12 GB). I explain how to set up the model and format the prompt step by step.
The examples detailed in this article are also implemented in this notebook:



