Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step

Your local multimodal chat model

Sep 02, 2024

∙ Paid

Alibaba’s Qwen2-VL are vision language models now available as 2B and 7B parameter models. They are generative language models that support multimodal inputs. You can provide Qwen2-VL, along with text, a single image, multiple images, or a 20-minute video!

The models demonstrate impressive performance in visual understanding. Like Microsoft’s Florence-2, Qwen2-VL can perform many types of tasks such as OCR, image captioning, question answering, visual grounding, etc.

Florence-2: Run Multitask Vision-language Models on Your Computer

Benjamin Marie

July 1, 2024

Read full story

Quantized versions, using the AWQ and GPTQ formats, were also published by Alibaba to facilitate deployment on smaller GPUs.

In this article, I first review Qwen2-VL’s architecture and performance. Then, we will see how to use Qwen2-VL with multiple images and videos using small GPUs (8 GB and 12 GB). I explain how to set up the model and format the prompt step by step.

The examples detailed in this article are also implemented in this notebook:

Get the notebook (#100)

The Kaitchup – AI on a Budget

Run Qwen2-VL on Your Computer with Text, Images, and Video, Step by Step

Your local multimodal chat model

Florence-2: Run Multitask Vision-language Models on Your Computer

This post is for paid subscribers