Feat (brevitas_examples/llm): better RMSNorm replacement #1436

Giuseppe5 · 2025-12-24T01:55:55Z

Reason for this PR

Some models have weird forward passes for RMSNorm, like gemma that uses (1 + weight) * input instead of (weight) * input.

Similarly, we were not handling correctly the fact that llama/gemma and maybe other models cast up to float32 and then back to (b)float16 during the forward pass.

Changes Made in this PR

~~We dynamically create a new class type that inherits from torch.nn.RMSNorm and from whatever RMSNorm class the LLM is using.~~

~~Then we initialize the internal dict manually (a bit scary but it works).~~

~~Finally we replace the newly created Frankestein instance with the old one.~~

~~Importantly, the new instance passes the check isinstance(module, torch.nn.RMSNorm), which we use to apply rotations.~~

All of the above does not work because dynamo cannot trace through custom classes.
Instead what we do now is that we replace just for dynamo to pick up the OG torch.nn.RMSNorm, and then we put back whatever RMSNorm class was originally intended.

Side effect, we need to carry around these classes types for some of our algorithms.

Testing Summary

All existing, hopefully I didn't break too many.

pablomlago · 2025-12-30T15:40:29Z

src/brevitas/graph/equalize.py

            delay_rewriters: bool = False,
-            return_rewriters: bool = False) -> None:
+            return_rewriters: bool = False,
+            extra_rmsnorm_classes: Optional[Tuple] = None) -> None:


I am not a big fan of having this extra argument and requiring to propagate these classes from the context manager. Can we instead extend _is_scale_invariant_module (https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/equalize.py#L855) to have this extra logic? Ideally, part of the logic of set(type(x) for x in model.modules() if 'RMS' in type(x).__name__)) could be extracted in a standalone method to be used both by _is_scale_invariant_module and rmsnorm_patch.

The issue is, RMS are not always "scale invariant", for example in the case of other algorithms like weight equalization, they aren't.

Also, this decouples better how we identify the RMSNorm modules versus how we check whether that is a scale invariant function/module.

I agree this is not ideal, but I don't fully agree with your suggestion either.
If anything, there should be a fully general way to customize all the attributes that are used during the region walk algorithm.

The easiest way would be to have a dict where a user can override all the keys, but then we would need to handle that within each class (or GraphRotation to start with), which could be verbose but doable.

Why is RMS not "scale invariant" for weight equalization?

Because you compute the per-channel variance of the input tensor, which changes if you have a scale factor per channel before/after this op

pablomlago · 2025-12-30T15:41:51Z