Unsloth Multi-GPU Training: device_map “balanced” vs DDP (torchrun / Accelerate)
Covers Unsloth multi-GPU support and multi-GPU training: when to use device_map="balanced" vs DDP, plus common Unsloth multi-GPU errors and fixes.
For training LLMs efficiently, Unsloth is now the go-to library: It is much faster than most other fine-tuning frameworks and is also very memory-efficient, especially when training with long sequences. Unsloth was designed for single-GPU training. While the authors of the frameworks are still working on bringing a clean and optimized support for multi-GPU training, it is already possible to enable multi-GPU training, even though it is clearly not optimized yet.
Fine-tuning LLMs across multiple GPUs sounds like it should be straightforward, but if you've tried combining Unsloth with Hugging Face's Accelerate or PyTorch's torchrun, you might have hit a wall.
Let’s clear the confusion and lay out exactly how to run Unsloth across multiple GPUs using either model-parallelism (splitting the model across GPUs) or data-parallelism (replicating the model on each GPU). Both are possible.
The following notebook shows a script and instructions that you can use to run Unsloth with multiple GPUs:
If you’re searching for Unsloth multi-GPU training, Unsloth multiple GPUs, or Unsloth multi-GPU support: the key is choosing one approach, device_map="balanced" (model parallelism) or DDP via torchrun/accelerate.
Related article:
Unsloth device_map ‘balanced’ vs DDP: why multi-GPU breaks
Unsloth’s documentation (as of August 8, 2025) is slightly confusing. It currently suggests:
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.3-70B-Instruct",
load_in_4bit = True,
device_map = "balanced",
)but if you launch training with:
accelerate launch train.py
# or
torchrun --nproc_per_node 4 train.pyWhy? Because device_map="balanced" triggers model-parallelism (multi-GPU sharding) inside a single Python process, while accelerate launch and torchrun with --nproc_per_node > 1 enable distributed data parallelism (DDP; multi-GPU data parallel), which runs one process per GPU.
You’ll see this error if you try:
ValueError: You can't train a model that has been loaded with `device_map='auto'`
in any distributed mode. Please rerun your script specifying `--num_processes=1`.Or if you see something like: “Unsloth cannot find any torch accelerator? you need a gpu.”, your environment isn’t exposing CUDA (common in containers/remote hosts).



