20 Nov 05:57

fmo-mt

7a6f07a

Latest

Release Note

We are excited to annound the release of torch_musa v2.7.0 based on PyTorch v2.7.1. Along with torch v2.7.1, we supported more features, like Dynamic Double Casting and Distributed Checkpointing. We have isolated the torchvision kernels from torch_musa, for users who like to use torchvision, one should install it from the repo that we have musified, see README for more details.

New Features

Dynamic Double Casting

We support dynamic cast for some operators of float64 dtype. Before we don't support much operators with float64 dtype, now one can set an environment variable "export TORCH_USE_MUSA_DOUBLE_CAST=1", and torch_musa will utilize float32 as the compute dtype;

Distributed Checkpointing

We enable Distributed Checkpoint, including Asynchronous checkpoint save, which support loading and saving models from multiple ranks in parallel. It can significantly accelerate the saving and loading of checkpoints;

MUSAExtension 'load'

We support "load" method for compiling MUSA extensions on the fly, which is quite useful for third party libraries that can be installed in many platforms, and during execution the kernels will be compiled or not depending on the platform environment;

EnhanceMent

Operators

We added Poisson, binomial, _standard_gamma, _sample_dirichlet, vdot, upsample(1d, 2d, 3d, with aa), flash_attention, transformer_encoder_layer...operators, the supported MUSA specified operators is over 1050;
We improved profiler (kineto) stability, upgrade musified kineto to version 2.7.0 as well;
We optimize memory usage for pipeline parallelism in FSDP2;
We supported more quantized operators which can be used in our model compression toolkit (will be released soon);

Features

The torch.compile and AOTInductor are both enhanced through the upgrading of torch;
TF32 is enabled by default;
Keep Improving stability of torch_musa by fixing some musa kernel potential bugs;

Known Issues

Some FFT operators are walkarounded through offloading to CPU, which will be fixed in the next release.

Enjoy.

Assets 2

21 Oct 08:05

fmo-mt

v2.5.0

0dbf6f1

torch_musa Release v2.5.0

Release Note

torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.

New Features

Support UMM for M1000

Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:

Elimination of duplicate memory allocation on GPU
Reduction of memory copy between host and device
Direct GPU access to memory originally allocated by CPU allocator

We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".

EnhanceMent

Operators

Support ilshift, irshift, replication_pad1d_bwd, angle, ctcLossTensor, ctcLossTensorBwd, logit, amin/amax/prod.dim_int, glu_bwd, etc;
Support some basic Sparse(csr) operations;
Add more quantized operators supported;
Fix torch.norm shape error;
Support reduce_sum uint8 dtype input and int64 dtype output;
Support tensor.is_musa(); in cpp extension;
Fix argmax/min with empty input;

Performances

Optimize performances of var/std, pad, convolution3d, layer_norm;

Functionality

Enable torch.musa.mccl.version() ;
Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
Optimize FSDP2 Pipeline parallelism memory consume;

Known Issues

Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.

Enjoy.

Assets 3

09 Sep 13:19

fmo-mt

v2.1.1

973ed69

torch_musa Release v2.1.1

torch_musa v2.1.1 bug fix release

torch_musa v2.1.1 is now available. This is an enhanced version of v2.1.0, aimed at fixing issues discovered during projects and improving core features. Despite some known issues, complete functional/integration tests have been passed based on MUSA 4.2.0. Native supported operators increased to over 948.

New Features

Support musagraphs backend for torch.compile, introducing reduced host overhead and e2e acceleration from musa-graph.
muSolver has been integrated into the backend of several linalg operators, including lu_factor_ex、lu_solve、solve_ex、cholesky_ex...
FusedAdamW/FusedAdam on MUSA are available on DTensor or other Tensor variants that based on the torch_dispatch mechanism.
Benchmark module has been expanded to include more operator cases.

EnhanceMent

Fixed the occurrence of 0-value in exponential，inspired from Intel MKL vRngExponential(...)
Ensured early return for some 0-numel op cases
Optimized one-hot by eliminating redundant preprocessing logics
Added rrelu_with_noise/nansum, RoPE supports multi-latent
Extended SDPA with no-batch inputs, enable mask-grad only for math backend
Fixed scatter_reduce crash and cross-entropy with none mode cases
Improved bandwiths of binary ops on rhs not last-contiguous cases

Assets 3

17 Jul 11:40

fmo-mt

v2.1.0

8ee39bb

torch_musa Release v2.1.0

Release Note

We are excited to annound the release of torch_musa v2.1.0 based on PyTorch v2.5.0. This release delivering optimized performance and flexibility across key PyTorch components on MUSA platform.
We support AOTInductor, FSDP2, also adapted with our Memory Management, Triton-MUSA, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 930. We've simplified MUSA integration with automatic torch_musa loading, users are not required to call "import torch_musa" in python scripts.

New Features

AOT Inductor

MUSA-backend support is now integrated into AOTInductor, enabling models to be ahead-of-time compiled for MUSA devices. This allows seamless inference acceleration via both C++ and Python runtimes, streamlining deployment on MUSA hardware.

FSDP2

Features DTensor-based per-parameter sharding FSDP with Moore Threads GPU optimization, enabling hardware-accelerated distributed training through custom sharding strategies and native mixed precision for Large Models.

Memory Management

We are pleased to introduce a pluggable MUSA (Memory Unified System Allocator) backend, providing greater flexibility and customization for memory management in your applications.

Triton-MUSA(reland)

Reintroduces the MUSA integration with TorchInductor based on PyTorch2.5 with reduced device-specific code.

EnhanceMent

Operators

We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 930 operators, by which we could deploy most of DL models from both industry and academia.

Math Ops: _masked_softmax, tril_indices, triu_indices, trace, ...
Statistical: nanmedian, normal, huber_loss, cauchy, log_normal,...
NN Ops: native_batch_norm, reflection_pad, fractional_max_pool, ...
Advanced Math: cosh, erfc, lgamma, digamma, polygamma,...

Performances

We've optimized quantization opertors, enhanced split and chunk operators. Add fused cross entropy loss implementation which can help reduce the peak memory usage. And many more - too numerous to list individually here.

Build

The MUSA backend now automatically initializes with torch - no manual imports or environment setup required. We also revamp the CMake build system to seamlessly integrate MUSA-accelerated Torch libraries in C++ projects through modern target-based dependency management.

Enjoy.

Assets 4

26 Jun 04:56

fmo-mt

v2.0.1

008913f

torch_musa Release v2.0.1

torch_musa v2.0.1 Release, bug fix release

Enhancements and bug fixes, including:

Fixed device index error of aten::_scaled_mm
Fixed runtime error of aten::all.dim
Cherry-picked security enhancement of making torch.load(*, weights_only=True)
Porting PyTorch headers to MUSA_PORT_xxx is deprecated
Adding more operators supported

Assets 2

27 Apr 09:24

fmo-mt

v2.0.0

8c8e412

torch_musa Release v2.0.0

Release Note

We are excited to annound the release of torch_musa v2.0.0 based on PyTorch v2.2.0.

In this release, we support MUSA virtual memory management, torch compile + torch inductor with triton backend, fused module with higher performances like SwiGLU and RoPE, MUSAGraph for arch greater than QY2, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 760.

With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

VMM(virtual memory management)

We have implemented the ExpandableSegment memory allocator based on the MUSA VMM API, which effectively mitigates GPU memory fragmentation and reduces peak memory consumption during model training, especially in LLMs training scenarios such as using FSDP, DeepSpeed and Megatron-LM.

MUSAGraph

We have implemented the MUSAGraph interface, which is consistent with CUDAGraph. It captures a sequence of MUSA kernels into a graph, which provides a mechanism to launch these kernels through a single CPU operation, and hence reduces the launching overheads. NOTE: Currently supports computational logic only (no MCCL support), and it's still a experimental feature in MUSARuntime

torch.compile for MUSA

We have integrated triton_musa backend into TorchInductor and implemented partial adaptations for TorchDynamo, which enabling users to accelerate both model training and inference through PyTorch's torch.compile interface

Fused modules & functionals

We support customize fusion modules torch.nn.RoPE, torch.nn.SwishGLU and FusedCrossEntropy, which can be used in LLMs to accelerate training and inference

FP8 support

We support FP8 dtype matmul and distribute communication in torch_musa for archs greater than QY2

EnhanceMent

Operators

We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 760 operators, by which we could deploy most of DL models from society

Build

We support multi-arch compilation, one can build torch_musa on any arch of MTGPU platform than run it on other platforms.

Enjoy.

Assets 6

23 Apr 03:26

fmo-mt

v1.3.2

af1592c

torch_musa Release v1.3.2

Release Notes

We are excited to release torch_musa v1.3.2 based on PyTorch v2.2.0!

In this release, we support torch_musa running on multiple archs and imported FP8 matmul, as well as torch.compile on MUSA backends, which are both useful for accelerating training/inference tasks. Another highlight feature is that user can implemente their customized operators by using torch.library through Python frontend, with the support of triton_musa, we should have more flexibility to implement high-efficiency operators. For training tasks, we support FusedAdam, which is highly recommanded in LLM training. In addition, we now have adapted more than 700 operators.

With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

Features

New features

Support torch_musa on multiple archs, and optimze compiler flags for better performancs;
Support torch.library, now users can implement their own kernel and operator by using torch.library, which is compatible with triton_musa;
Support FP8 matmul;
Support MUSA backend for torch.compile and torch Inductor, which is a highly recommended feature of PyTorch and now we have it on MUSA;
Support FusedAdam optimizer, which has better performance than the original one, and we also had some customize optimization included;
Support TCPStore with Libuv backend;

Operators supporting

New operators: torch.std, rmsnorm.out, reflection_pad, torch.mish, torch.logsigmoid
New dtypes supported:
- int to float of torch.sum
- Long for torch.histc
- Bool for torch.index_select
- Bool for torch.add
- Int and float dtypes for torch.masked_select and torch.masked_scatter

Bugs fixed & Enhancements

Fix arm platform cannot link libmusa_kernels.so
Fix error of indexing kernel with negative indices
Fix missing dtype supports of MCCL
Fix math SDPA with ComputeMode setting
Fix low performance of torch.gather
Update amp with more privateuse1 compatible
Fix shared memory misalign pointer in S5000
Fix error of clamp with different input dtypes
Optimize compilation steps of torch_musa

Assets 2

05 Nov 03:44

fmo-mt

v1.3.0

73c9f5b

torch_musa Release v1.3.0

Highlights

We are excited to release torch_musa v1.3.0 based on PyTorch v2.2.0. In this release, we support FSDP (Fully Sharded Data Parallel) for large model training, as well as improve the stability and efficiency of diferent operators. In general, we add more operators and support more dtypes of Tensors for many operators on our MUSA backend.

With torch_musa v1.3.0, users can utilize most features released in PyTorch v2.2.0 on MUSA GPU, and gain more stable training and inference for many kinds of models in various fields, including the recently popular large language models.

The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

Enhancements

FSDP

We recommand users to refer offical FSDP doc for more utilization details, and move back to our torch_musa to get the same experiences as the original one.

Operators support

1.Support operators including torch.conv_transpose_3d, torch.fmod, torch.fmax and torch.fmin etc.

2.Support more dtypes for torch.sort, torch.unique etc.

Documentation

We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.

Dockers

We provide Release docker image and development docker image.

Assets 17

07 Aug 09:35

zehan-mt

v1.2.1

7dcb8b2

torch_musa Release v1.2.1

Highlights

We are excited to release torch_musa v1.2.1 based on PyTorch v2.0.0. In this release, we support some basic and important features, including torch_musa profiler, musa_extension, musa_convert, codegen and compare_tool. In addition, we now have adapted more than 600 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

New Features

torch_musa profiler

We have adapted pytorch's official performance analysis tool, torch.profiler. Users can use this adapted tool to analyze the performance details of pytorch model training or inference tasks running on the MUSA platform. It can capture information about operators called at the host level or kernels executed on the GPU device.

musa_extension

We have implemented the MUSAExtension interface, which is consistent with CUDAExtension. It can be used to build customized operators based on the MUSA platform, making full use of GPU resources to accelerate calculations. Many pytorch third-party ecological libraries that utilize CUDAExtension can also be easily ported to the MUSA platform.

musa_converter

We have developed a convert tool named musa_converter that translates pytorch-cuda related strings and APIS in PyTorch scripts into torch_musa compatible code, which improve the efficiency of model migration from CUDA platform to MUSA platform. Users can run musa_converter -h to see the usage of musa_converter.

codegen

We introduce the codegen module to handle the automatic binding and registration of customized musa kernels. It extends from torchgen, follows the format patterns of native_functions.yaml file, also supports different custom strategies, which can significantly reduce the workload of developers.

compare_tool

This tool is designed to enhance the debugging and validation process of PyTorch models by offering capabilities for comparing tensor operations across devices, tracking module hierarchies, and detecting the presence of NaN/Inf values. It is aimed at ensuring the correctness and stability of models through various stages of development and testing.

operator_benchmark

We followed PyTorch operator_benchmark suite and adapted it into torch_musa. Developers can utilize it the same way as in PyTorch. It helps developers to generate fully characterized performance of an operator, and developers can compare result with the one generated from CUDA or other accelerate backends, continuously improve the performances of torch_musa.

Enhancements

Operators support

1.Support operators including torch.mode, torch.count_nonzero, torch.sort(stable=True), torch.upsample2d/3d, torch.logical_or/and etc.

2.Support more dtypes for torch.scatter, torch.eq, torch.histc, torch.logsumexp etc.

Operators and modules optimize

1.Optimize and accelerate operators like Indexing kernels, embedding kernels, torch.nonzero, torch.unique, torch.clamp etc.

2.Enable manual seed setting for dropout layer.

3.Support SDP(scale-dot production) with GQA(grouped-query attention) and causal mask.

4.Now the AMP usage is aligned with CUDA as torch.autocast would automatically enable torch_musa amp.

Documentation

We provide developer documentation for developers, which describes the development environment preparation and some development steps in detail.

Dockers

We provide Release docker image and development docker image.

Assets 17

14 Mar 04:52

hanhaowen-mt

v1.1.0

1a14c97

torch_musa Release v1.1.0

torch_musa Release Notes

Highlights
New Features
- AMP mixed precision training
- MUSAExtension
- Pinned memory
- TensorCore computation
- CompareTool [Experimental]
Supported Operators
Documentation
Dockers

Highlights

We are excited to release torch_musa v1.1.0 based on PyTorch v2.0.0. In this release, we support more import features, including AMP mixed precision training, MUSAExtension, TensorCore computation, pinned memory and CompareTool. In addition, we have adapted more than 470 operators, improved DDP module and implemented more quantization operators. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

AMP mixed precision training

Now we support mixed precision training of BF16 and FP16. However, it is worth noting that S80 and S3000 only support fp16, while S4000 supports both fp16 and bf16, and the interface is completely consistent with PyTorch. Users can use AMP like the following code:

# low_dtype can be torch.float16 or torch.bfloat16
def train_in_amp(low_dtype=torch.float16):
    set_seed()
    model = SimpleModel().to(DEVICE)
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    # create the scaler object
    scaler = torch.musa.amp.GradScaler()

    inputs = torch.randn(6, 5).to(DEVICE)  # 将数据移至GPU
    targets = torch.randn(6, 3).to(DEVICE)
    for step in range(20):
        optimizer.zero_grad()
        # create autocast environment
        with torch.musa.amp.autocast(dtype=low_dtype):
            outputs = model(inputs)
            assert outputs.dtype == low_dtype
            loss = criterion(outputs, targets)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    return loss

MUSAExtension

MUSAExtension and CUDAExtension are basically the same, except that MUSAExtension needs to manually add a dynamic library to the dynamic library search path. For detailed usage, please refer to torch_musa/torch_musa/utils/README.md and the developer documentation. This issue will be resolved in the next version.

Pinned memory

Pinned memory now is supported by torch_musa, the following code can utilize it.

cpu_tensor = torch.rand(shape, dtype=torch.float32).pin_memory("musa")
gpu_tensor = cpu_tensor.to("musa", non_blocking=True)

TensorCore computation

The S4000 has tensorcore, therefore it supports TF32 format calculations. Users can utilize TF32 for acceleration using the following code:

with torch.backends.mudnn.flags(allow_tf32=True):
      # your train code.

CompareTool [Experimental]

CompareTool is an experimental tool aimed at automatically comparing the computation results between musa and cpu, thereby facilitating the debugging process. For detailed usage, please refer to torch_musa/utils/README.md

Supported Operators

More than 470 operators are supported in torch_musa.

Documentation

We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.

Dockers

Release docker image and development docker image are available now.

[NOTE]: If you want to compile torch_musa without using the provided docker image, please download the rc2.0.0 Intel CPU_Ubuntu underlying software stack in https://developer.mthreads.com/sdk/download/musa?equipment=&os=&driverVersion=&version=

[NOTE]:

- When installing following released whl package, please remove the device name. For example,
- pip install torch-2.0.0-cp310-cp310-linux_x86_64.whl

Assets 20

Releases: MooreThreads/torch_musa

torch_musa Release v2.7.0

Release Note

New Features

Dynamic Double Casting

Distributed Checkpointing

MUSAExtension 'load'

EnhanceMent

Operators

Features

Known Issues

Uh oh!

torch_musa Release v2.5.0

Release Note

New Features

Support UMM for M1000

EnhanceMent

Operators

Performances

Functionality

Known Issues

Uh oh!

torch_musa Release v2.1.1

torch_musa v2.1.1 bug fix release

New Features

EnhanceMent

Uh oh!

torch_musa Release v2.1.0

Release Note

New Features

AOT Inductor

FSDP2

Memory Management

Triton-MUSA(reland)

EnhanceMent

Operators

Performances

Build

Uh oh!

torch_musa Release v2.0.1

torch_musa v2.0.1 Release, bug fix release

Uh oh!

torch_musa Release v2.0.0

Release Note

New Features

VMM(virtual memory management)

MUSAGraph

torch.compile for MUSA

Fused modules & functionals

FP8 support

EnhanceMent

Operators

Build

Uh oh!

torch_musa Release v1.3.2

Release Notes

Features

New features

Operators supporting

Bugs fixed & Enhancements

Uh oh!

torch_musa Release v1.3.0

Highlights

Enhancements

FSDP

Operators support

Documentation

Dockers

Uh oh!

torch_musa Release v1.2.1

Highlights

New Features

torch_musa profiler

musa_extension

musa_converter

codegen

compare_tool

operator_benchmark

Enhancements

Operators support