Phi-4 Multimodal: A Mixture of Audio and Vision LoRA Adapters

A multimodal mixture of LoRA

Mar 03, 2025

∙ Paid

In December 2024, Microsoft released Phi-4, a powerful 14B parameter model. While impressive, models of this size are difficult to run or fine-tune on consumer hardware.

To address this, Microsoft introduced Phi-4 Mini, a smaller 3.8B parameter version. It can be fine-tuned on a 24GB GPU and runs smoothly on a 12GB GPU. Alongside Phi-4 Mini, Microsoft also released Phi-4 Multimodal, which is capable of processing audio, visual, and text inputs, a rare feature among open models.

Microsoft achieved this by integrating and fine-tuning multimodal LoRA adapters on top of Phi-4 Mini, adding only 1.73B additional parameters.

In this article, we'll explore how Microsoft developed this multimodal LoRA model. We'll also evaluate Phi-4 Mini's performance and examine its efficiency after 8-bit, 4-bit, 3-bit, and 2-bit quantization. The model can be accurately quantized to 4-bit, which makes it a 3 GB model!

If you want to fine-tune the model, you can use the same code I shared in this article and accompanying notebook:

Get the notebook (#97)

Phi-4: What's New and How to Fine-Tune It on Your Computer (+ quantized version)

Benjamin Marie

December 19, 2024

Read full story

The Kaitchup – AI on a Budget

Phi-4 Multimodal: A Mixture of Audio and Vision LoRA Adapters

A multimodal mixture of LoRA

Phi-4: What's New and How to Fine-Tune It on Your Computer (+ quantized version)

Phi-4 Mini with Multimodal LoRA

This post is for paid subscribers