Florence-2: Run Multitask Vision-language Models on Your Computer

Object detection, OCR, image captioning, segmentation, and grounding

Jul 01, 2024

∙ Paid

Captioning and object detection with Florence-2

Florence-2 by Microsoft are state-of-the-art vision-language models (VLMs) designed to tackle an extensive array of vision and vision-language tasks using a prompt-based approach. The models are remarkably small and Florence-2 is better than other VLMs 100x larger.

Two versions are available, 230M and 770M. The 770M version is small enough to run on a 4 GB GPU.

In this article, I review Florence-2. We will see what neural architecture they use and how they were trained. Then, I show how to use the model for various vision-language tasks such as image captioning and segmentation, OCR, and object detection.

I wrote a notebook that you can use to try Florence-2 with your own images:

Get the notebook (#83)

The Kaitchup – AI on a Budget

Florence-2: Run Multitask Vision-language Models on Your Computer

Object detection, OCR, image captioning, segmentation, and grounding

This post is for paid subscribers