The Kaitchup – AI on a Budget

The Kaitchup – AI on a Budget

Unsloth Multi-GPU Training: device_map “balanced” vs DDP (torchrun / Accelerate)

Covers Unsloth multi-GPU support and multi-GPU training: when to use device_map="balanced" vs DDP, plus common Unsloth multi-GPU errors and fixes.

Benjamin Marie's avatar
Benjamin Marie
Aug 11, 2025
∙ Paid
Image generated with ChatGPT

For training LLMs efficiently, Unsloth is now the go-to library: It is much faster than most other fine-tuning frameworks and is also very memory-efficient, especially when training with long sequences. Unsloth was designed for single-GPU training. While the authors of the frameworks are still working on bringing a clean and optimized support for multi-GPU training, it is already possible to enable multi-GPU training, even though it is clearly not optimized yet.

The Kaitchup publishes weekly articles like this one, along with notebooks, to learn how to adapt LLMs to your tasks and hardware. Subscribe to support my work:

Fine-tuning LLMs across multiple GPUs sounds like it should be straightforward, but if you've tried combining Unsloth with Hugging Face's Accelerate or PyTorch's torchrun, you might have hit a wall.

Let’s clear the confusion and lay out exactly how to run Unsloth across multiple GPUs using either model-parallelism (splitting the model across GPUs) or data-parallelism (replicating the model on each GPU). Both are possible.

The following notebook shows a script and instructions that you can use to run Unsloth with multiple GPUs:

Get the notebook (#177)

If you’re searching for Unsloth multi-GPU training, Unsloth multiple GPUs, or Unsloth multi-GPU support: the key is choosing one approach, device_map="balanced" (model parallelism) or DDP via torchrun/accelerate.

Related article:

Fine-Tuning RNJ-1 with Unsloth: 4x Faster on a Single GPU

Fine-Tuning RNJ-1 with Unsloth: 4x Faster on a Single GPU

Benjamin Marie
·
December 22, 2025
Read full story

Unsloth device_map ‘balanced’ vs DDP: why multi-GPU breaks

Unsloth’s documentation (as of August 8, 2025) is slightly confusing. It currently suggests:

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-70B-Instruct",
    load_in_4bit = True,
    device_map    = "balanced",
)

but if you launch training with:

accelerate launch train.py
# or
torchrun --nproc_per_node 4 train.py

Why? Because device_map="balanced" triggers model-parallelism (multi-GPU sharding) inside a single Python process, while accelerate launch and torchrun with --nproc_per_node > 1 enable distributed data parallelism (DDP; multi-GPU data parallel), which runs one process per GPU.

You’ll see this error if you try:

ValueError: You can't train a model that has been loaded with `device_map='auto'`
in any distributed mode. Please rerun your script specifying `--num_processes=1`.

Or if you see something like: “Unsloth cannot find any torch accelerator? you need a gpu.”, your environment isn’t exposing CUDA (common in containers/remote hosts).

The Right Way: Choose ONE Parallel Strategy

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 The Kaitchup · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture