The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Florence-2: Run Multitask Vision-language Models on Your Computer

Florence-2: Run Multitask Vision-language Models on Your Computer

Object detection, OCR, image captioning, segmentation, and grounding

Benjamin Marie's avatar
Benjamin Marie
Jul 01, 2024
∙ Paid
8

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Florence-2: Run Multitask Vision-language Models on Your Computer
1
Share
Captioning and object detection with Florence-2

Florence-2 by Microsoft are state-of-the-art vision-language models (VLMs) designed to tackle an extensive array of vision and vision-language tasks using a prompt-based approach. The models are remarkably small and Florence-2 is better than other VLMs 100x larger.

Two versions are available, 230M and 770M. The 770M version is small enough to run on a 4 GB GPU.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, I review Florence-2. We will see what neural architecture they use and how they were trained. Then, I show how to use the model for various vision-language tasks such as image captioning and segmentation, OCR, and object detection.

I wrote a notebook that you can use to try Florence-2 with your own images:

Get the notebook (#83)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share