How to Run Unsloth on Multi-GPU Setups: Data-Parallel or Model-Parallel

Step-by-step fixes for running Unsloth across GPUs

Aug 11, 2025

∙ Paid

For training LLMs efficiently, Unsloth is now the go-to library: It is much faster than most other fine-tuning frameworks and is also very memory-efficient, especially when training with long sequences. Unsloth was designed for single-GPU training. While the authors of the frameworks are still working on bringing a clean and optimized support for multi-GPU training, it is already possible to enable multi-GPU training, even though it is clearly not optimized yet.

Fine-tuning LLMs across multiple GPUs sounds like it should be straightforward, but if you've tried combining Unsloth with Hugging Face's Accelerate or PyTorch's torchrun, you might have hit a wall.

Let’s clear the confusion and lay out exactly how to run Unsloth across multiple GPUs using either model-parallelism (splitting the model across GPUs) or data-parallelism (replicating the model on each GPU). Both are possible.

The following notebook shows a script and instructions that you can use to run Unsloth with multiple GPUs:

Get the notebook (#177)

The Trap: Mixing `device_map="balanced"` with DDP

Unsloth’s documentation (as of August 8, 2025) is slightly confusing. It currently suggests:

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.3-70B-Instruct",
    load_in_4bit = True,
    device_map    = "balanced",
)

but if you launch training with:

accelerate launch train.py
# or
torchrun --nproc_per_node 4 train.py

It doesn't work.

Why? Because device_map="balanced" triggers model-parallelism inside a single Python process, while accelerate launch and torchrun with --nproc_per_node > 1 enable distributed data parallelism (DDP), which runs one process per GPU.

You’ll see this error if you try:

ValueError: You can't train a model that has been loaded with `device_map='auto'`
in any distributed mode. Please rerun your script specifying `--num_processes=1`.

The Kaitchup – AI on a Budget

How to Run Unsloth on Multi-GPU Setups: Data-Parallel or Model-Parallel

Step-by-step fixes for running Unsloth across GPUs

The Trap: Mixing device_map="balanced" with DDP

This post is for paid subscribers

The Trap: Mixing `device_map="balanced"` with DDP