| Documentation | Blog | Paper | Developer Slack |
- [2025/06] 🔥 UltraAttn is accepted by The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’25).
UltraAttn is an automatic and adaptive implementation of context parallelism for block sparse attention that persues optimal performance. Existing context parallelism suffers from poor scalability due to the striped-like partition pattern, which causes high communication traffic, and the ring-based communication pattern, which limits kernel granularity, reduces device utilization, and incurs redundant communication. UltraAttn hierarchically tiles the context at the node and device levels to reduce communication cost, and applies kernel-level tiling to balance kernel overlap and device utilization. An ILP-based runtime further optimizes distributed attention latency. On 64 GPUs, UltraAttn achieves an average 5.5× speedup over state-of-the-art context parallelism methods across various block sparse attention types.
The core of UltraAttn is the hierarchically context-tiling at three levels, i.e. node, device and kernel levels follow by an ILP-based runtime. Hierarchically context-tiling aims to automatically partition and allocate workload to nodes, devices and kernels to minimize communication overhead in the condition of computational load balance. The output is an parallel dependent kernel graph. ILP-based UltraAttn runtime aims to execute the parallel dependent kernel graph efficiently.
If you are curious or interesting about the design philosophy of UltraAttn, please refer to UltraAttn design document.
End-to-end relative performance of UltraAttn and baselines on Llama2-7b. Performance is normalized by UltraAttn. The speedup text above the last bar shows UltraAttn’s gain over the best baseline.
Configuration: 
Relative performance, normalized by UltraAttn, of two types of block sparse attention, i.e., strided attention and global+local attention, for training from 8 GPUs to 64 GPUs.
Forward block sparse attention
Backward block sparse attention
Relative performance normalized by UltraAttn of two types of dense attention for training.
Forward dense attention
Relative performance normalized by UltraAttn of two types of block sparse attention for inference from 2 to 8 GPUs.

- Support document mask attention
- Support better kernel scheduling
- Support fusion into a megakernel
- Add flexattention in kernel backends for more flexible kernel-level context-tiling
- Support optimal memory management
- Support numerically stable implementation
We’d truly appreciate your feedback. If you run into any issues or have ideas for improvement, please don’t hesitate to share them by opening an issue.
If you use UltraAttn for your research, please cite our paper.
UltraAttn uses Apache License 2.0.


