Fine-Tuning Qwen3: Base vs. Reasoning Models
Is it reasonable to fine-tune a "reasoning" model?
Qwen3 LLMs are both very capable and easy to run. Some of the models are small enough to be fine-tuned or run inference on a single GPU.
The Qwen team released two types of models: Qwen3 and Qwen3-Base. The naming is a bit different from what you might be used to. For example, with Llama models, the name without any suffix (like Llama 3.1 8B) refers to the base, pre-trained version, while Llama 3.1 8B Instruct is the post-trained one. For Qwen3, it's the opposite:
Qwen3 is the post-trained model (chat/instruction-tuned+reasoning).
Qwen3-Base is the raw pre-trained model, without alignment or instruction tuning.
So, if you want to fine-tune one of these models on your own data, which should you choose?
In a previous article, I explained why fine-tuning a post-trained (instruction-tuned) model isn't always a good idea, and why the base model is usually a better place to start. That argument holds up even more when working with models designed for reasoning.
In this post, I’ll fine-tune both Qwen3-14B and Qwen3-14B-Base and then compare how the resulting models behave at inference time with reasoning on and off. The fine-tuning was done using Unsloth on a single GPU. I’ll also show what kind of GPU memory you need to get it working.
The code and setup are in this notebook: