MicroMix is a mixed-precision quantization method using MXFP8/MXFP6/MXFP4. (paper: arxiv)

conda create -n micromix python=3.10 -y
conda activate micromixPlease make sure that CUDA 12.8 is in your environment.
git clone --recurse-submodules https://github.com/lwy2020/MicroMix.git
cd MicroMix
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txtReorder_indices, p6_num, p8_num are needed for quantization:
python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric meanResults are saved in saved/
Please refer to mgemm/README.md
cd mgemm/bash test.sh /PATH/TO/YOUR/MODEL/bash eval_plus/test.sh Qwen/Qwen2.5-Coder-32B-Instruct '32B'If you want to use the MicroMix kernel but not our algorithm, you can directly set p4_num, p6_num, p8_num (line 41-43 in /model/qLinearLayer.py) as the numbers you want 😄
MicroMix efficiency:
python benchmarks/benchmark_e2e_micromix.py --model 'llama-3.1-8b' --batch_size 8 --prefill_seq_len 2048FP16 efficiency:
python benchmarks/benchmark_e2e_fp16.py --model /PATH/TO/YOUR_MODEL --batch_size 8 --prefill_seq_len 2048INT8 efficiency:
pip install bitsandbytes==0.47.0
python benchmarks/benchmark_e2e_int8.py --model /PATH/TO/YOUR_MODEL --batch_size 12 --prefill_seq_len 2048@misc{liu2025micromixefficientmixedprecisionquantization,
title={MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models},
author={Wenyuan Liu and Haoqian Meng and Yilun Luo and Peng Zhang and Xindian Ma},
year={2025},
eprint={2508.02343},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.02343},
}
Our code is built on the following repos, thank you for your contributions to community 👍: