Releases: MooreThreads/torch_musa
torch_musa Release v2.7.0
Release Note
We are excited to annound the release of torch_musa v2.7.0 based on PyTorch v2.7.1. Along with torch v2.7.1, we supported more features, like Dynamic Double Casting and Distributed Checkpointing. We have isolated the torchvision kernels from torch_musa, for users who like to use torchvision, one should install it from the repo that we have musified, see README for more details.
New Features
Dynamic Double Casting
We support dynamic cast for some operators of float64 dtype. Before we don't support much operators with float64 dtype, now one can set an environment variable "export TORCH_USE_MUSA_DOUBLE_CAST=1", and torch_musa will utilize float32 as the compute dtype;
Distributed Checkpointing
We enable Distributed Checkpoint, including Asynchronous checkpoint save, which support loading and saving models from multiple ranks in parallel. It can significantly accelerate the saving and loading of checkpoints;
MUSAExtension 'load'
We support "load" method for compiling MUSA extensions on the fly, which is quite useful for third party libraries that can be installed in many platforms, and during execution the kernels will be compiled or not depending on the platform environment;
EnhanceMent
Operators
- We added Poisson, binomial, _standard_gamma, _sample_dirichlet, vdot, upsample(1d, 2d, 3d, with aa), flash_attention, transformer_encoder_layer...operators, the supported MUSA specified operators is over 1050;
- We improved profiler (kineto) stability, upgrade musified kineto to version 2.7.0 as well;
- We optimize memory usage for pipeline parallelism in FSDP2;
- We supported more quantized operators which can be used in our model compression toolkit (will be released soon);
Features
- The torch.compile and AOTInductor are both enhanced through the upgrading of torch;
- TF32 is enabled by default;
- Keep Improving stability of torch_musa by fixing some musa kernel potential bugs;
Known Issues
- Some FFT operators are walkarounded through offloading to CPU, which will be fixed in the next release.
Enjoy.
torch_musa Release v2.5.0
Release Note
torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.
New Features
Support UMM for M1000
Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:
- Elimination of duplicate memory allocation on GPU
- Reduction of memory copy between host and device
- Direct GPU access to memory originally allocated by CPU allocator
We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".
EnhanceMent
Operators
- Support
ilshift,irshift,replication_pad1d_bwd,angle,ctcLossTensor,ctcLossTensorBwd,logit,amin/amax/prod.dim_int,glu_bwd, etc; - Support some basic Sparse(csr) operations;
- Add more quantized operators supported;
- Fix
torch.normshape error; - Support
reduce_sumuint8 dtype input and int64 dtype output; - Support
tensor.is_musa(); in cpp extension; - Fix
argmax/minwith empty input;
Performances
- Optimize performances of var/std, pad, convolution3d, layer_norm;
Functionality
- Enable torch.musa.mccl.version() ;
- Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
- Optimize FSDP2 Pipeline parallelism memory consume;
Known Issues
- Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.
Enjoy.
torch_musa Release v2.1.1
torch_musa v2.1.1 bug fix release
torch_musa v2.1.1 is now available. This is an enhanced version of v2.1.0, aimed at fixing issues discovered during projects and improving core features. Despite some known issues, complete functional/integration tests have been passed based on MUSA 4.2.0. Native supported operators increased to over 948.
New Features
- Support
musagraphsbackend for torch.compile, introducing reduced host overhead and e2e acceleration from musa-graph. - muSolver has been integrated into the backend of several linalg operators, including lu_factor_ex、lu_solve、solve_ex、cholesky_ex...
- FusedAdamW/FusedAdam on MUSA are available on DTensor or other Tensor variants that based on the torch_dispatch mechanism.
- Benchmark module has been expanded to include more operator cases.
EnhanceMent
- Fixed the occurrence of 0-value in exponential,inspired from Intel MKL vRngExponential(...)
- Ensured early return for some 0-numel op cases
- Optimized one-hot by eliminating redundant preprocessing logics
- Added rrelu_with_noise/nansum, RoPE supports multi-latent
- Extended SDPA with no-batch inputs, enable mask-grad only for math backend
- Fixed scatter_reduce crash and cross-entropy with
nonemode cases - Improved bandwiths of binary ops on rhs not last-contiguous cases
torch_musa Release v2.1.0
Release Note
We are excited to annound the release of torch_musa v2.1.0 based on PyTorch v2.5.0. This release delivering optimized performance and flexibility across key PyTorch components on MUSA platform.
We support AOTInductor, FSDP2, also adapted with our Memory Management, Triton-MUSA, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 930. We've simplified MUSA integration with automatic torch_musa loading, users are not required to call "import torch_musa" in python scripts.
New Features
AOT Inductor
MUSA-backend support is now integrated into AOTInductor, enabling models to be ahead-of-time compiled for MUSA devices. This allows seamless inference acceleration via both C++ and Python runtimes, streamlining deployment on MUSA hardware.
FSDP2
Features DTensor-based per-parameter sharding FSDP with Moore Threads GPU optimization, enabling hardware-accelerated distributed training through custom sharding strategies and native mixed precision for Large Models.
Memory Management
We are pleased to introduce a pluggable MUSA (Memory Unified System Allocator) backend, providing greater flexibility and customization for memory management in your applications.
Triton-MUSA(reland)
Reintroduces the MUSA integration with TorchInductor based on PyTorch2.5 with reduced device-specific code.
EnhanceMent
Operators
We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 930 operators, by which we could deploy most of DL models from both industry and academia.
- Math Ops: _masked_softmax, tril_indices, triu_indices, trace, ...
- Statistical: nanmedian, normal, huber_loss, cauchy, log_normal,...
- NN Ops: native_batch_norm, reflection_pad, fractional_max_pool, ...
- Advanced Math: cosh, erfc, lgamma, digamma, polygamma,...
Performances
We've optimized quantization opertors, enhanced split and chunk operators. Add fused cross entropy loss implementation which can help reduce the peak memory usage. And many more - too numerous to list individually here.
Build
The MUSA backend now automatically initializes with torch - no manual imports or environment setup required. We also revamp the CMake build system to seamlessly integrate MUSA-accelerated Torch libraries in C++ projects through modern target-based dependency management.
Enjoy.
torch_musa Release v2.0.1
torch_musa v2.0.1 Release, bug fix release
Enhancements and bug fixes, including:
- Fixed device index error of
aten::_scaled_mm - Fixed runtime error of
aten::all.dim - Cherry-picked security enhancement of making
torch.load(*, weights_only=True) - Porting PyTorch headers to MUSA_PORT_xxx is deprecated
- Adding more operators supported
torch_musa Release v2.0.0
Release Note
We are excited to annound the release of torch_musa v2.0.0 based on PyTorch v2.2.0.
In this release, we support MUSA virtual memory management, torch compile + torch inductor with triton backend, fused module with higher performances like SwiGLU and RoPE, MUSAGraph for arch greater than QY2, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 760.
With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
VMM(virtual memory management)
We have implemented the ExpandableSegment memory allocator based on the MUSA VMM API, which effectively mitigates GPU memory fragmentation and reduces peak memory consumption during model training, especially in LLMs training scenarios such as using FSDP, DeepSpeed and Megatron-LM.
MUSAGraph
We have implemented the MUSAGraph interface, which is consistent with CUDAGraph. It captures a sequence of MUSA kernels into a graph, which provides a mechanism to launch these kernels through a single CPU operation, and hence reduces the launching overheads. NOTE: Currently supports computational logic only (no MCCL support), and it's still a experimental feature in MUSARuntime
torch.compile for MUSA
We have integrated triton_musa backend into TorchInductor and implemented partial adaptations for TorchDynamo, which enabling users to accelerate both model training and inference through PyTorch's torch.compile interface
Fused modules & functionals
We support customize fusion modules torch.nn.RoPE, torch.nn.SwishGLU and FusedCrossEntropy, which can be used in LLMs to accelerate training and inference
FP8 support
We support FP8 dtype matmul and distribute communication in torch_musa for archs greater than QY2
EnhanceMent
Operators
We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 760 operators, by which we could deploy most of DL models from society
Build
We support multi-arch compilation, one can build torch_musa on any arch of MTGPU platform than run it on other platforms.
Enjoy.
torch_musa Release v1.3.2
Release Notes
We are excited to release torch_musa v1.3.2 based on PyTorch v2.2.0!
In this release, we support torch_musa running on multiple archs and imported FP8 matmul, as well as torch.compile on MUSA backends, which are both useful for accelerating training/inference tasks. Another highlight feature is that user can implemente their customized operators by using torch.library through Python frontend, with the support of triton_musa, we should have more flexibility to implement high-efficiency operators. For training tasks, we support FusedAdam, which is highly recommanded in LLM training. In addition, we now have adapted more than 700 operators.
With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
Features
New features
- Support torch_musa on multiple archs, and optimze compiler flags for better performancs;
- Support torch.library, now users can implement their own kernel and operator by using torch.library, which is compatible with triton_musa;
- Support FP8 matmul;
- Support MUSA backend for torch.compile and torch Inductor, which is a highly recommended feature of PyTorch and now we have it on MUSA;
- Support FusedAdam optimizer, which has better performance than the original one, and we also had some customize optimization included;
- Support TCPStore with Libuv backend;
Operators supporting
- New operators: torch.std, rmsnorm.out, reflection_pad, torch.mish, torch.logsigmoid
- New dtypes supported:
- int to float of torch.sum
- Long for torch.histc
- Bool for torch.index_select
- Bool for torch.add
- Int and float dtypes for torch.masked_select and torch.masked_scatter
Bugs fixed & Enhancements
- Fix arm platform cannot link libmusa_kernels.so
- Fix error of indexing kernel with negative indices
- Fix missing dtype supports of MCCL
- Fix math SDPA with ComputeMode setting
- Fix low performance of torch.gather
- Update amp with more privateuse1 compatible
- Fix shared memory misalign pointer in S5000
- Fix error of clamp with different input dtypes
- Optimize compilation steps of torch_musa
torch_musa Release v1.3.0
Highlights
We are excited to release torch_musa v1.3.0 based on PyTorch v2.2.0. In this release, we support FSDP (Fully Sharded Data Parallel) for large model training, as well as improve the stability and efficiency of diferent operators. In general, we add more operators and support more dtypes of Tensors for many operators on our MUSA backend.
With torch_musa v1.3.0, users can utilize most features released in PyTorch v2.2.0 on MUSA GPU, and gain more stable training and inference for many kinds of models in various fields, including the recently popular large language models.
The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
Enhancements
FSDP
We recommand users to refer offical FSDP doc for more utilization details, and move back to our torch_musa to get the same experiences as the original one.
Operators support
1.Support operators including torch.conv_transpose_3d, torch.fmod, torch.fmax and torch.fmin etc.
2.Support more dtypes for torch.sort, torch.unique etc.
Documentation
We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.
Dockers
We provide Release docker image and development docker image.
torch_musa Release v1.2.1
Highlights
We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler, musa_extension, musa_convert, codegen and compare_tool. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
torch_musa profiler
We have adapted pytorch's official performance analysis tool, torch.profiler. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.
musa_extension
We have implemented the MUSAExtension interface, which is consistent with CUDAExtension. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension can also be easily ported to the MUSA platform.
musa_converter
We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h to see the usage of musa_converter.
codegen
We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.
compare_tool
This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.
operator_benchmark
We followed PyTorch operator_benchmark suite and adapted it into torch_musa. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa.
Enhancements
Operators support
1.Support operators including torch.mode, torch.count_nonzero, torch.sort(stable=True), torch.upsample2d/3d, torch.logical_or/and etc.
2.Support more dtypes for torch.scatter, torch.eq, torch.histc, torch.logsumexp etc.
Operators and modules optimize
1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero, torch.unique, torch.clamp etc.
2.Enable manual seed setting for dropout layer.
3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.
4.Now the AMP usage is aligned with CUDA as torch.autocast would automatically enable torch_musa amp.
Documentation
We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.
Dockers
We provide Release docker image and development docker image.
torch_musa Release v1.1.0
torch_musa Release Notes
- Highlights
- New Features
- AMP mixed precision training
- MUSAExtension
- Pinned memory
- TensorCore computation
- CompareTool [Experimental]
- Supported Operators
- Documentation
- Dockers
Highlights
We are excited to release torch_musa v1.1.0 based on PyTorch v2.0.0. In this release, we support more import features, including AMP mixed precision training, MUSAExtension, TensorCore computation, pinned memory and CompareTool. In addition, we have adapted more than 470 operators, improved DDP module and implemented more quantization operators. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
AMP mixed precision training
Now we support mixed precision training of BF16 and FP16. However, it is worth noting that S80 and S3000 only support fp16, while S4000 supports both fp16 and bf16, and the interface is completely consistent with PyTorch. Users can use AMP like the following code:
# low_dtype can be torch.float16 or torch.bfloat16
def train_in_amp(low_dtype=torch.float16):
set_seed()
model = SimpleModel().to(DEVICE)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# create the scaler object
scaler = torch.musa.amp.GradScaler()
inputs = torch.randn(6, 5).to(DEVICE) # 将数据移至GPU
targets = torch.randn(6, 3).to(DEVICE)
for step in range(20):
optimizer.zero_grad()
# create autocast environment
with torch.musa.amp.autocast(dtype=low_dtype):
outputs = model(inputs)
assert outputs.dtype == low_dtype
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
return lossMUSAExtension
MUSAExtension and CUDAExtension are basically the same, except that MUSAExtension needs to manually add a dynamic library to the dynamic library search path. For detailed usage, please refer to torch_musa/torch_musa/utils/README.md and the developer documentation. This issue will be resolved in the next version.
Pinned memory
Pinned memory now is supported by torch_musa, the following code can utilize it.
cpu_tensor = torch.rand(shape, dtype=torch.float32).pin_memory("musa")
gpu_tensor = cpu_tensor.to("musa", non_blocking=True)TensorCore computation
The S4000 has tensorcore, therefore it supports TF32 format calculations. Users can utilize TF32 for acceleration using the following code:
with torch.backends.mudnn.flags(allow_tf32=True):
# your train code.CompareTool [Experimental]
CompareTool is an experimental tool aimed at automatically comparing the computation results between musa and cpu, thereby facilitating the debugging process. For detailed usage, please refer to torch_musa/utils/README.md
Supported Operators
More than 470 operators are supported in torch_musa.
Documentation
We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.
Dockers
Release docker image and development docker image are available now.
[NOTE]: If you want to compile torch_musa without using the provided docker image, please download the rc2.0.0 Intel CPU_Ubuntu underlying software stack in https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version=
[NOTE]:
- When installing following released whl package, please remove the device name. For example,
- pip install torch-2.0.0-cp310-cp310-linux_x86_64.whl