DenseMixer: Smarter MoE Routing That Doesn’t Break LoRA and QLoRA
Better MoE training for a slightly higher cost
Mixture-of-Experts models deliver parameter-scaled capacity with inference-time sparsity, but their training is brittle because routing relies on hard Top-K selection. Conventional practice backpropagates as if the Top-K operator were differentiable and masked to the active set, which restricts gradients to currently selected experts and yields biased, high-variance updates to the router.
Heuristics such as freezing the router or updating a handpicked subset of experts can stabilize optimization, but they constrain adaptation and often leave capacity underused, especially when the active set changes across data domains or during post-training.
DenseMixer addresses this by preserving hard routing in the forward pass while replacing the backward rule with a straight-through estimator that exposes router gradients from all experts.
The method integrates with common libraries and PEFT, introduces mainly forward-only compute overhead (e.g., ~1.5x FLOPs per layer on a 30B MoE), and in post-training consistently improves downstream quality across scales, architectures, and datasets without changing model architecture or training code beyond enabling the plugin.
This article covers the method end-to-end: we begin with the theory and then move to practice. It’s simple to use, and it promises better fine-tuned MoE models. We will fine-tune an adapter for a Qwen3 MoE model and examine differences in learning curves and computational cost.
Here is the notebook I used to run my fine-tuning experiments with the DenseMixer to better train MoE models: