VLMs vs. LLMs: Which is Better for Text Generation?

Should you still use an LLM in 2025?

Mar 20, 2025

∙ Paid

Over the past six months, we've seen a steady stream of new vision-language models (VLMs). Models like Qwen2-VL, Llama 3.2 Vision, Molmo, Pixtral, Qwen2.5-VL, Phi-4 Multimodal, and Aya Expanse have all demonstrated strong performance across multimodal tasks requiring vision capabilities, such as OCR, visual understanding, image captioning, and object recognition.

A common trait among all these models is that they are built on top of large language models (LLMs). For example, Llama 3.2 Vision is derived from Llama 3.1, Qwen2.5-VL is based on Qwen2.5, and Phi-4 Multimodal originates from Phi-4 Mini. These VLMs gain their multimodal capabilities through a post-training phase, where their base LLMs are either further refined using multimodal data or enhanced with trained adapters that introduce multimodal abilities.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

I'm currently offering a 30% discount on the annual subscription to The Kaitchup and The Kaitchup Pro!

Get a 30% Discount Now

During this post-training process, the model is exposed to not just multimodal information but also additional textual data. This means that its language understanding capabilities continue to be trained alongside its vision capabilities, ensuring that it excels in tasks like caption generation and OCR, which require both vision and strong text generation skills.

This raises an important question: does multimodal post-training improve or degrade a model's performance on standard language generation tasks, such as creative writing, language understanding, and instruction-following?

In this article, we address this question using Qwen2.5 as a case study. We will evaluate Qwen2.5-VL on common benchmarks used for LLMs and compare its performance against its base LLM. We will see that multimodal post-training can actually enhance a model’s accuracy on language tasks, especially for larger models.

Since Qwen2.5-VL outperforms Qwen2.5 in these tasks, we will also explore how to leverage it for pure language generation tasks that do not require vision inputs. Specifically, I’ll show you how to use Qwen2.5-VL with vLLM for language inference.

I’ve created a notebook showing how I’ve evaluated the models and that demonstrates how to use a VLM for language generation tasks with vLLM. You can find it here:

Get the notebook (#152)

Finally, we will also examine the trade-offs in inference cost—comparing speed and memory consumption—when using a VLM for language generation versus an LLM.

The Kaitchup – AI on a Budget

VLMs vs. LLMs: Which is Better for Text Generation?

Should you still use an LLM in 2025?

This post is for paid subscribers