Florence-2 are small language-vision models (VLMs) with impressive capabilities. They can accurately perform a wide range of tasks such as image captioning, object detection, OCR, and more. In a previous article, we saw how to run Florence-2:
Florence-2 can also be fine-tuned to tackle more accurately specific domains and tasks. For instance, we can fine-tune it to be a multimodal chat model, i.e., to discuss an image with the model, similarly to what we can do with GPT-4.
In this article, we will see how to turn Florence-2 into a multimodal chat model. Since Florence-2 is a small model, fine-tuning is quick and doesn’t consume a huge amount of memory. Even though we won’t use PEFT methods, the fine-tuning of Florence-2 large can be done on consumer hardware, for instance using a 16 GB GPU.
I explain all the code in the following sections. I also implemented a notebook that you can use to fine-tune Florence-2 on your own data: