Florence-2: Run Multitask Vision-language Models on Your Computer
Object detection, OCR, image captioning, segmentation, and grounding
Florence-2 by Microsoft are state-of-the-art vision-language models (VLMs) designed to tackle an extensive array of vision and vision-language tasks using a prompt-based approach. The models are remarkably small and Florence-2 is better than other VLMs 100x larger.
Two versions are available, 230M and 770M. The 770M version is small enough to run on a 4 GB GPU.
In this article, I review Florence-2. We will see what neural architecture they use and how they were trained. Then, I show how to use the model for various vision-language tasks such as image captioning and segmentation, OCR, and object detection.
I wrote a notebook that you can use to try Florence-2 with your own images: