The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer

Your own local GPT-4

Benjamin Marie's avatar
Benjamin Marie
Jul 08, 2024
∙ Paid
5

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
Fine-tune a Multimodal Chat Model with Florence-2 on Your Computer
3
Share
Illustration by the author

Florence-2 are small language-vision models (VLMs) with impressive capabilities. They can accurately perform a wide range of tasks such as image captioning, object detection, OCR, and more. In a previous article, we saw how to run Florence-2:

Florence-2: Run Multitask Vision-language Models on Your Computer

Florence-2: Run Multitask Vision-language Models on Your Computer

Benjamin Marie
·
July 1, 2024
Read full story

Florence-2 can also be fine-tuned to tackle more accurately specific domains and tasks. For instance, we can fine-tune it to be a multimodal chat model, i.e., to discuss an image with the model, similarly to what we can do with GPT-4.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In this article, we will see how to turn Florence-2 into a multimodal chat model. Since Florence-2 is a small model, fine-tuning is quick and doesn’t consume a huge amount of memory. Even though we won’t use PEFT methods, the fine-tuning of Florence-2 large can be done on consumer hardware, for instance using a 16 GB GPU.

I explain all the code in the following sections. I also implemented a notebook that you can use to fine-tune Florence-2 on your own data:

Get the notebook (#85)

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share