Skip to content

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Notifications You must be signed in to change notification settings

lwy2020/MicroMix

Repository files navigation

MicroMix

MicroMix is a mixed-precision quantization method using MXFP8/MXFP6/MXFP4. (paper: arxiv)

1. Installation

conda create -n micromix python=3.10 -y
conda activate micromix

Please make sure that CUDA 12.8 is in your environment.

git clone --recurse-submodules https://github.com/lwy2020/MicroMix.git
cd MicroMix
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

2. Usage

2.1 Preprocessing

Reorder_indices, p6_num, p8_num are needed for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 32 --seqlen 2048 --act_sort_metric mean

Results are saved in saved/

2.2 Building Kernels

Please refer to mgemm/README.md

cd mgemm/

2.3 Zero-shot, Few-shot Accuracy and Perplexity Evaluation

bash test.sh /PATH/TO/YOUR/MODEL/

2.4 Code Generation Evaluation

bash eval_plus/test.sh Qwen/Qwen2.5-Coder-32B-Instruct  '32B'

If you want to use the MicroMix kernel but not our algorithm, you can directly set p4_num, p6_num, p8_num (line 41-43 in /model/qLinearLayer.py) as the numbers you want 😄

3. Efficiency Evaluation

MicroMix efficiency:

python benchmarks/benchmark_e2e_micromix.py --model 'llama-3.1-8b' --batch_size 8 --prefill_seq_len 2048

FP16 efficiency:

python benchmarks/benchmark_e2e_fp16.py --model /PATH/TO/YOUR_MODEL --batch_size 8 --prefill_seq_len 2048

INT8 efficiency:

pip install bitsandbytes==0.47.0
python benchmarks/benchmark_e2e_int8.py --model /PATH/TO/YOUR_MODEL --batch_size 12 --prefill_seq_len 2048

Citation

@misc{liu2025micromixefficientmixedprecisionquantization,
      title={MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models}, 
      author={Wenyuan Liu and Haoqian Meng and Yilun Luo and Peng Zhang and Xindian Ma},
      year={2025},
      eprint={2508.02343},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.02343}, 
}

Acknowledagement

Our code is built on the following repos, thank you for your contributions to community 👍:

About

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •