Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Your own local GPT-4

Jul 08, 2024

∙ Paid

Florence-2 are small language-vision models (VLMs) with impressive capabilities. They can accurately perform a wide range of tasks such as image captioning, object detection, OCR, and more. In a previous article, we saw how to run Florence-2:

Florence-2: Run Multitask Vision-language Models on Your Computer

Benjamin Marie

July 1, 2024

Read full story

Florence-2 can also be fine-tuned to tackle more accurately specific domains and tasks. For instance, we can fine-tune it to be a multimodal chat model, i.e., to discuss an image with the model, similarly to what we can do with GPT-4.

In this article, we will see how to turn Florence-2 into a multimodal chat model. Since Florence-2 is a small model, fine-tuning is quick and doesn’t consume a huge amount of memory. Even though we won’t use PEFT methods, the fine-tuning of Florence-2 large can be done on consumer hardware, for instance using a 16 GB GPU.

I explain all the code in the following sections. I also implemented a notebook that you can use to fine-tune Florence-2 on your own data:

Get the notebook (#85)

The Kaitchup – AI on a Budget

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Your own local GPT-4

Florence-2: Run Multitask Vision-language Models on Your Computer

This post is for paid subscribers