Muon torch submission [WIP] #15

Niccolo-Ajroldi · 2025-09-30T18:54:29Z

New Submission: Muon

Submission Information

submission_name: "muon_torch"  # As it will appear on the leaderboard
submission_folder: "submissions/external_tuning/"  # Name of folder within `submissions/external_tuning/` or `submissions/self_tuning/` (lowercase, no spaces)
authors: "Niccolò Ajroldi*"  # List authors separated by commas
affiliations: "ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems"  # List all affiliations of the authors, separated by commas
version: "1.0"  # Optional version number of your submission
ruleset: "external"  # Either "external" or "self-tuning"
framework: "PyTorch"  # Either "PyTorch" or "JAX"
description: "Muon implementation in PyTorch."  # A short, high-level description of the algorithm

*credits to original Muon implementation from Keller Jordan

Evidence for the Submission's Performance

Muon original blogpost
Muon is Scalable for LLM Training
Modded-nanogpt

Submission Details

The final goal is to have a single implementation to score. In the development process, we find three possible different implementations of Muon:

MuonVanilla: the optimization algorithm is replicated identically across devices: each rank orthogonalizes all parameters.
MuonKJ: this design follows up-to-date code on Muon GitHub repo.
Each device updates a distinct subset of parameters locally, which are later AllGathered.
This exploits PyTorch 2.5 possibility of all-gathering parameters of different shape (docs).
MuonBucketed: parameters are bucketed into buckets of same-shaped params.
Each device updates a distinct subset of parameters locally, which are later AllGathered.
I believe a similar approach is followed in Dion.

A visualization of the different parallelization strategy is reported in this diagram.

MuonBucketed parallelization strategy is similar to the OG MuonKJ implementation, resulting a similar number of NCCL calls. The main advantage is that AllGather calls would be across tensors of the same shape, hence being likely faster than AllGathering across tensors of different shape. The downside of this approach is that each bucket might need padding, hence resulting in some GPUs being idle while other compute updates. This appears to be less efficient compared to MuonKJ, hence we will not further consider MuonBucketed.

Since different devices update different parameters, each device only requires the reduced gradients for the subset it updates. Therefore, when using MuonKJ (or MuonBucketed) it is superfluous to AllReduce gradients before optimizer step, as traditionally done with DDP when calling loss.backward(). We just need to ReduceScatter the gradients, apply Muon, and AllGather the updated parameters. Crucially, the scatter operation should matches the block structure of MuonKJ distributed update.

Notice however that torch AllReduce called is optimized to overlap with gradient computation during the backward pass. Therefore, we decide to test both MuonKJ with and without a custom ReduceScatter implementation, to test if we can improve over the highly optimized torch AllReduce.

We implement 3 submissions:

muon_vanilla.py, uses MuonVanilla, with traditional DP all-reduce grads.
muon_kj.py: uses MuonKJ, with traditional DP all-reduce grads.
muon_kj_custom_reduce_scatter.py, uses MuonKJ, with the aforementioned reduce-scatter of grads.

We compare these implementations:

Equivalence: do they yield the same updates?
Are the efficient DP algorithms faster than the vanilla one?
Is the KJ version with reduce-scatter faster than the KJ version with traditional DP all-reduce?

Comparison results

We compare the implementations by means of yielded loss (in a deterministic run) and accumulated submission time in a dedicated wandb report.

Implementation equivalence

We verify that the three implementations are equivalent across workloads: loss profiles are overlapped for all workloads, but criteo1tb. Further check on criteo1tb might be necessary.

Speed

We compare implementations on 4xA100-80GB, training for 5% of step_hint, and repeating the experiment for 5 different random seeds. We report here the mean total time (in minutes) accumulated across workloads.

optim_name	accumulated_submission_time_min	time_saved_over_vanilla_min
vanilla	349.66	0.00
KJ	337.26	12.41
KJ_RS	337.98	11.69

We observe high variability in some workloads, and scoring order varies across workloads. Overall, KJ is significantly faster than the vanilla implementation, tho the ReduceScatter version doesn't clearly improve over just using AllReduce.

Parameters allocation: Muon vs Adam

We use AdamW as the backup optimizer, optimizing the following parameters with it:

1D params (biases, layernorm, batchnorm)
Embeddings of WMT, CRITEO, identified by embeddings in param name

We attach txt files of the resulting parameter split for each workloads for ease of inspection.

Momentum implementation

We follow the Adam-style EMA implementation of momentum, as also done in Muon official repo, and in
modded-nanogpt. Notice, however, that the original formulation of Muon uses PyTorch-SGD-style momentum, and a similar implementation is followed by MoonShootAI.

Notes

3D and 4D parameters are flattened on the trailing dimensions and NS orthogonalization is applied.
Nesterov momentum is supported, and we support dampening in Muon momentum implementation.
We use separate learning rates for muon and AdamW.

Discussion points

Should we further optimize the ReduceScatter to potentially overlap with backward?
Should we use AdamW or NAdamW as a backup optimizer?

Next steps

github-actions · 2025-09-30T18:54:40Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…lgorithms into muon_torch

muon torch vanilla DP

1a4a873

Niccolo-Ajroldi requested a review from a team as a code owner September 30, 2025 18:54

Niccolo-Ajroldi changed the title ~~muon torch vanilla DP~~ Muon torch submission (vanilla DP) Sep 30, 2025

Niccolo-Ajroldi added 4 commits October 3, 2025 10:38

add muon dampening hyperparam

4462202

cleanup: split muon-adam

87c64dd

rename MuonVanilla, add MuonKJ, add MuonBucketed, custom reduce scatter

81f66b7

add tests

b2326ff

Niccolo-Ajroldi changed the title ~~Muon torch submission (vanilla DP)~~ Muon torch submission Oct 6, 2025

Niccolo-Ajroldi changed the title ~~Muon torch submission~~ Muon torch submission [WIP] Oct 6, 2025

Niccolo-Ajroldi and others added 20 commits October 7, 2025 13:21

add diagrams

5e914e0

Add files via upload

98c5585

Add files via upload

2dd75e1

add utils, polished, add param aplitting

cccffe0

moved diagrams

9ff8184

enable AdamW fused optim

3926947

Add files via upload

ec5d1a7

Add files via upload

3798e2e

Add files via upload

8895419

cleaned diagrams

4b291ed

Add files via upload

66c0385

fix ReduceScatter

231c5dd

Merge branch 'muon_torch' of github.com:Niccolo-Ajroldi/submissions_a…

9c865e5

…lgorithms into muon_torch

Add files via upload

6c61acc

removed dampening, use adam style ema

4618f38

separate lr and wd for muon adam

eb0f779

separate lr and wd for muon adam

3ad574f

format

8649960

Merge branch 'muon_torch' of github.com:Niccolo-Ajroldi/submissions_a…

d28b9b6

…lgorithms into muon_torch

clean

1d5ff98

Niccolo-Ajroldi added 2 commits December 11, 2025 15:09

cleanup, def muon submission

23b198b

cleanup

fd6584d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Muon torch submission [WIP] #15

Muon torch submission [WIP] #15

Uh oh!

Niccolo-Ajroldi commented Sep 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Muon torch submission [WIP] #15

Are you sure you want to change the base?

Muon torch submission [WIP] #15

Uh oh!

Conversation

Niccolo-Ajroldi commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Submission: Muon

Submission Information

Evidence for the Submission's Performance

Submission Details

Comparison results

Implementation equivalence

Speed

Parameters allocation: Muon vs Adam

Momentum implementation

Notes

Discussion points

Next steps

Uh oh!

github-actions bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Niccolo-Ajroldi commented Sep 30, 2025 •

edited

Loading

github-actions bot commented Sep 30, 2025 •

edited

Loading