The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
VLMs vs. LLMs: Which is Better for Text Generation?
Copy link
Facebook
Email
Notes
More

VLMs vs. LLMs: Which is Better for Text Generation?

Should you still use an LLM in 2025?

Benjamin Marie's avatar
Benjamin Marie
Mar 20, 2025
∙ Paid
10

Share this post

The Kaitchup – AI on a Budget
The Kaitchup – AI on a Budget
VLMs vs. LLMs: Which is Better for Text Generation?
Copy link
Facebook
Email
Notes
More
1
Share
Generated with ChatGPT

Over the past six months, we've seen a steady stream of new vision-language models (VLMs). Models like Qwen2-VL, Llama 3.2 Vision, Molmo, Pixtral, Qwen2.5-VL, Phi-4 Multimodal, and Aya Expanse have all demonstrated strong performance across multimodal tasks requiring vision capabilities, such as OCR, visual understanding, image captioning, and object recognition.

A common trait among all these models is that they are built on top of large language models (LLMs). For example, Llama 3.2 Vision is derived from Llama 3.1, Qwen2.5-VL is based on Qwen2.5, and Phi-4 Multimodal originates from Phi-4 Mini. These VLMs gain their multimodal capabilities through a post-training phase, where their base LLMs are either further refined using multimodal data or enhanced with trained adapters that introduce multimodal abilities.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

I'm currently offering a 30% discount on the annual subscription to The Kaitchup and The Kaitchup Pro!

Get a 30% Discount Now

During this post-training process, the model is exposed to not just multimodal information but also additional textual data. This means that its language understanding capabilities continue to be trained alongside its vision capabilities, ensuring that it excels in tasks like caption generation and OCR, which require both vision and strong text generation skills.

This raises an important question: does multimodal post-training improve or degrade a model's performance on standard language generation tasks, such as creative writing, language understanding, and instruction-following?

In this article, we address this question using Qwen2.5 as a case study. We will evaluate Qwen2.5-VL on common benchmarks used for LLMs and compare its performance against its base LLM. We will see that multimodal post-training can actually enhance a model’s accuracy on language tasks, especially for larger models.

Since Qwen2.5-VL outperforms Qwen2.5 in these tasks, we will also explore how to leverage it for pure language generation tasks that do not require vision inputs. Specifically, I’ll show you how to use Qwen2.5-VL with vLLM for language inference.

I’ve created a notebook showing how I’ve evaluated the models and that demonstrates how to use a VLM for language generation tasks with vLLM. You can find it here:

Get the notebook (#152)

Finally, we will also examine the trade-offs in inference cost—comparing speed and memory consumption—when using a VLM for language generation versus an LLM.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 The Kaitchup
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More