How to Run Unsloth on Multi-GPU Setups: Data-Parallel or Model-Parallel
Step-by-step fixes for running Unsloth across GPUs
For training LLMs efficiently, Unsloth is now the go-to library: It is much faster than most other fine-tuning frameworks and is also very memory-efficient, especially when training with long sequences. Unsloth was designed for single-GPU training. While the authors of the frameworks are still working on bringing a clean and optimized support for multi-GPU training, it is already possible to enable multi-GPU training, even though it is clearly not optimized yet.
Fine-tuning LLMs across multiple GPUs sounds like it should be straightforward, but if you've tried combining Unsloth with Hugging Face's Accelerate or PyTorch's torchrun
, you might have hit a wall.
Let’s clear the confusion and lay out exactly how to run Unsloth across multiple GPUs using either model-parallelism (splitting the model across GPUs) or data-parallelism (replicating the model on each GPU). Both are possible.
The following notebook shows a script and instructions that you can use to run Unsloth with multiple GPUs:
The Trap: Mixing device_map="balanced"
with DDP
Unsloth’s documentation (as of August 8, 2025) is slightly confusing. It currently suggests:
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.3-70B-Instruct",
load_in_4bit = True,
device_map = "balanced",
)
but if you launch training with:
accelerate launch train.py
# or
torchrun --nproc_per_node 4 train.py
It doesn't work.
Why? Because device_map="balanced"
triggers model-parallelism inside a single Python process, while accelerate launch
and torchrun
with --nproc_per_node > 1
enable distributed data parallelism (DDP), which runs one process per GPU.
You’ll see this error if you try:
ValueError: You can't train a model that has been loaded with `device_map='auto'`
in any distributed mode. Please rerun your script specifying `--num_processes=1`.