Make WeightedAggregationHelper memory efficient #3884

holgerroth · 2025-12-09T21:14:08Z

Fixes # .

For the first contribution, use v.clone().mul_(weight) instead of v * weight - this creates only one copy instead of an intermediate tensor.
Use add_(v, alpha=weight) which is equivalent to += v * weight but done in-place without creating any intermediate tensors. This is the biggest memory saver.
Use div_(self.counts[k]) for in-place division instead of creating a new tensor with multiplication.
Backward compatibility: The code checks if tensors support in-place ops and falls back to the original approach for non-PyTorch data types.

For N clients with model size M:

Before: Creates ~2M temporary tensors during aggregation (weighted_value + sum result for each parameter)
After: Creates ~0.5M temporary tensors (only initial clone), all other ops are in-place

For a large model (e.g., 1B parameters as float32 = 4GB), this saves approximately 4-8GB of peak memory during aggregation with just a few clients.

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

make WeightedAggregationHelper memory efficient

b2a6c4d

Provide feedback