Releases: intel/auto-round
V0.9.4 patch release
V0.9.3 Release
Highlights
- Added ark backend by @Zhenzhong1 in #1075
- reduce vram usage for optimized RTN mode by @wenhuach21 in #1043
- Support alg_ext on windows by @chensuyue in #1082
- adjust 2/3 bits hyperparameters at auto-round-best by @wenhuach21 in #1081
- fix bug of outside block layers do not use best_params by @n1ck-guo in #1128
- Fix nvfp4&fp8 packing ram issue, refine all exporting format ram release for origin layer by @yiliu30 in #1129
What's Changed
- fix accuracy regression by @wenhuach21 in #1041
- add environment.md and remove AutoRoundMLLM usage in readme by @xin3he in #1042
- add memory monitor and import auto-scheme on demand by @wenhuach21 in #1049
- Update get_block_names func by @mengniwang95 in #1047
- set add_bos_token=True for llama model by @n1ck-guo in #1046
- Announce llmc integration by @yiliu30 in #1055
- [High Risk]reduce vram usage for optimized RTN mode by @wenhuach21 in #1043
- Add static FP8 attention support by @yiliu30 in #1045
- Enhance tokenizer saving by @mengniwang95 in #1057
- Revert "Add static FP8 attention support" by @yiliu30 in #1060
- refine readme by @wenhuach21 in #1063
- Fix transformers==4.57.1 in CI by @XuehaoSun in #1066
- Add static FP8 attention support by @yiliu30 in #1061
- Remove tbb by @yiliu30 in #1069
- support for gguf mixed q2_k_s by @n1ck-guo in #1059
- Add compatibility test for ARM by @XuehaoSun in #1073
- Optimize CPU CI pipelines by @XuehaoSun in #1071
- Export KV Scheme in LLMC config by @yiliu30 in #1068
- Add LLMC integration test by @yiliu30 in #1053
- Support transformers loading quantized moe model by @mengniwang95 in #1067
- update alg_ext and add ut by @n1ck-guo in #1064
- simplify what's new and add publication_list by @xin3he in #1070
- Add MXFP8 MOE/Linear and MXFP4 Linear by @yiliu30 in #1034
- improve accuracy for 2bit with auto-round-best by @wenhuach21 in #1078
- Support mxfp nvfp lmhead quant by @WeiweiZhang1 in #1051
- refine sampler by @wenhuach21 in #1077
- fix bf16 option in AutoScheme by @wenhuach21 in #1079
- adjust 3 bits hyperparameters at auto-round-best by @wenhuach21 in #1081
- Support alg_ext on windows by @chensuyue in #1082
- [vLLM Ext]Fix MXFP4 Quant by @yiliu30 in #1088
- remove numpy restriction for gptq kernel by @wenhuach21 in #1084
- Fix MXFP/NVFP + FP8 Attn/KV by @yiliu30 in #1086
- Remove accelerate version limitation by @chensuyue in #1090
- bump version to v0.9.3 by @chensuyue in #1091
- add system checker in backend by @wenhuach21 in #1097
- Refactor input normalization by replaying inputs for consistent preprocessing by @yiliu30 in #1094
- fix gguf acc and oom bug when iters > 0 by @n1ck-guo in #1098
- Add Python 3.14 to compatibility test pipeline by @XuehaoSun in #1096
- Fix typo in README.md by @xin3he in #1102
- refine lmhead ut by @WeiweiZhang1 in #1106
- Move packed res to cpu by @yiliu30 in #1104
- remove non-essential requirements by @n1ck-guo in #1103
- fix auto-scheme/alg-ext multiple devices issue by @wenhuach21 in #1107
- fix release version cuda ut fail by @n1ck-guo in #1110
- fix gguf packing device by @n1ck-guo in #1105
- fix quant fp8 model with iters=0 and scheme=nvfp4 by @n1ck-guo in #1114
- Move quantized block to cpu by @yiliu30 in #1115
- fix gguf extension bug by @wenhuach21 in #1116
- fix bug of triton pow wrong data_type when enable torch compile by @n1ck-guo in #1120
- Use modelscope cache in CPU UT by @XuehaoSun in #1124
- fix bug of outside block layers do not use best_params by @n1ck-guo in #1128
- Fix nvfp4&fp8 packing ram issue, refine all exporting format ram release for origin layer by @yiliu30 in #1129
- fix regression by @xin3he in #1135
- Upgrade llmc to main and add cuda UT by @yiliu30 in #1111
- Enable load MXFP4/MXFP8 + FP8 KV by @yiliu30 in #1095
- Remove duplicate packages by @XuehaoSun in #1139
- fix bug of auto scheme with user layer config by @n1ck-guo in #1133
- Add AutoRound binary build and publish workflow by @chensuyue in #1132
- update readme by @wenhuach21 in #1141
- update document for eval by @xin3he in #1140
- update windows binary for alg_ext by @chensuyue in #1142
- Fix device mismatch of nvfp fuse scale by @WeiweiZhang1 in #1143
- add low_cpu_mem_usage in cli by @n1ck-guo in #1146
- fix cuda ut fail by @n1ck-guo in #1144
- add llmc for cuda ut by @yiliu30 in #1145
- Added ark backend by @Zhenzhong1 in #1075
New Contributors
- @Zhenzhong1 made their first contribution in #1075
Full Changelog: v0.9.2...v0.9.3
v0.9.2 patch release
Remove accelerate version limitation #1090
v0.9.1 patch release
Fix installation on ARM devices.
v0.9.0
Highlights
- support automatic mixed bits assignment by @wenhuach21 in #851
- optimize rtn for int woq by @wenhuach21 in #924
- support for model scope by @n1ck-guo in #957
- enhance auto device map and support XPU by @xin3he in #961
- support for immediate saving to reduce ram usage by @Kaihui-intel in #965
- update gguf alg ext by @n1ck-guo in #1026
What's Changed
- Fix rtn tuning_device issue by @Kaihui-intel in #893
- fix vlm gguf ut by @n1ck-guo in #895
- update alg_ext.abi3.so with python compatible version by @chensuyue in #894
- move ste from quant to round for nvfp4 by @xin3he in #889
- Add GPT-OSS quant support by @yiliu30 in #887
- better help printing information by @n1ck-guo in #883
- speedup quant and evaluation, fix recompile issue by @xin3he in #897
- fix nvfp act quantization bug by @WeiweiZhang1 in #891
- support automatic mixed bits assignment by @wenhuach21 in #851
- try to fix gguf vram issue on windows by @wenhuach21 in #886
- remove numba from requirments by @yiliu30 in #905
- Extend mxfp loading dtypes by @yiliu30 in #907
- block dataset logger info by @n1ck-guo in #908
- fix torch compile issue in AutoScheme by @wenhuach21 in #909
- Revert "Extend mxfp loading dtypes" by @wenhuach21 in #915
- support disable_opt_rtn in auto-scheme by @wenhuach21 in #913
- fix llama 4 ut by @n1ck-guo in #896
- Add numba for cpu lib by @yiliu30 in #919
- Loosen the packing restrictions for mxfp&nvfp by @WeiweiZhang1 in #911
- Extend mxfp loading dtypes by @yiliu30 in #916
- Fix act config exporting for mixed schemes by @WeiweiZhang1 in #903
- optimize rtn for int woq by @wenhuach21 in #924
- fix bug of gguf and support for LiquidAI/LFM2-1.2B by @n1ck-guo in #927
- remove numpy<2.0 limitation by @xin3he in #921
- enable regex quantization config saving for mixed bits by @WeiweiZhang1 in #825
- Fix Flux tuning issue by @mengniwang95 in #936
- gguf support for inclusionAI/Ling-flash-2.0 by @n1ck-guo in #940
- remove low_cpu_mem by @n1ck-guo in #934
- Add compatibility test by @XuehaoSun in #918
- Add commit hash to version by @XuehaoSun in #941
- gguf weight type align with original, output.weight, token_embed by @n1ck-guo in #900
- support attention mask in user's dataset by @wenhuach21 in #930
- Add diffusion README by @mengniwang95 in #923
- update readme by @wenhuach21 in #949
- refactor utils file by @n1ck-guo in #943
- update readme for sglang support by @WeiweiZhang1 in #953
- update gguf and support for CompressedLinear by @n1ck-guo in #950
- Reduce AutoSchem VRAM usage by up to 10X by @wenhuach21 in #944
- add self attribution and fix avg_bits error by @xin3he in #956
- add logo by @wenhuach21 in #960
- refine AutoScheme readme/code by @wenhuach21 in #958
- update readme by @wenhuach21 in #962
- fix critic disable_opt_rtn regression by @wenhuach21 in #963
- [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) by @yiliu30 in #935
- fix bug of imatrix contains 0 by @n1ck-guo in #955
- fix rtn bug by @mengniwang95 in #966
- enhance flux doc by @mengniwang95 in #967
- clean code by @wenhuach21 in #968
- support for model scope by @n1ck-guo in #957
- merge main branch to alg_ext by @wenhuach21 in #970
- fix cuda CI backend issue, fixtypo by @WeiweiZhang1 in #974
- disable compile packing by default by @yiliu30 in #975
- enhance auto device map and support XPU by @xin3he in #961
- refine readme by @wenhuach21 in #978
- cli support for positional arguments model by @n1ck-guo in #979
- update bits in UT by @xin3he in #986
- fix guff scheme and device_map bug by @n1ck-guo in #969
- add support for Magistral-Small by @n1ck-guo in #980
- support model_dtype and fix bug of scheme contains quotes, mllm eval by @n1ck-guo in #985
- fix bug of cannot create adam compressor by @n1ck-guo in #992
- [CI] Update python to 3.12 and torch to 2.8.0 by @XuehaoSun in #741
- fix lm head bug and rm clear_mem_reach_threhold by @wenhuach21 in #997
- Reduce peak gpu memory usage and support moe estimation by @xin3he in #981
- fix cuda ut bug by @n1ck-guo in #999
- fix mllm device_map ut by @Kaihui-intel in #1000
- refine md tables by @WeiweiZhang1 in #994
- Refine exllamav2 ut by @WeiweiZhang1 in #1001
- Support for immediate saving to reduce ram usage by @Kaihui-intel in #965
- Fix diffusion multi-device ut issue by @mengniwang95 in #1002
- fix multiple devices map issue in calibration by @wenhuach21 in #1003
- Fix non auto device map by @WeiweiZhang1 in #1005
- fix multiple devices issue in Compressor and AutoScheme by @wenhuach21 in #1007
- fix cuda low_cpu_mem_usage ut by @Kaihui-intel in #1010
- Fix param missing bug by @mengniwang95 in #1008
- add device list to clear memory by @wenhuach21 in #1009
- Minor refactor for LLMC by @yiliu30 in #993
- fix one clear memory issue by @wenhuach21 in #1011
- add ut for gguf alg_ext and update so file by @n1ck-guo in #1012
- fix multi cuda ut bug by @n1ck-guo in #1014
- Including auto_scheme.default_alg into pypi by @chensuyue in #1018
- add num_device check for set_auto_device_map_for_block_with_tuning by @xin3he in #1021
- dispatch model with real max memory by @xin3he in #1022
- fix cuda ut by @n1ck-guo in #1020
- disable itrex format first by @WeiweiZhang1 in #998
- fix bug of lm_head and dispatch model,gguf eval by @n1ck-guo in #1025
- Fix the missing temporary name by @yiliu30 in #1029
- Reduce mem usage of GPT-OSS by @yiliu30 in #1013
- update gguf alg ext by @n1ck-guo in #1026
- optimize vram for gguf and add momentum by @wenhuach21 in #1031
- fix incorrect model name in readme by @wenhuach21 in #1035
- Bump into v0.9.0 by @XuehaoSun in #1024
Full Changelog: v0.8.0...v0.9.0
v0.8.0
Highlights
- merge all api(MLLM, Adam) into AutoRound by @n1ck-guo in #791
- MXFP4 and MXFP8 loading support by @yiliu30 in #832
- Support Flux quantization by @mengniwang95 in #850
What's Changed
- fix cuda ut bug of use_deterministic_algorithms by @n1ck-guo in #805
- remove torch compile in nv quant by @wenhuach21 in #807
- Support loading for static quant weight fp8 act fp8 by @yiliu30 in #730
- fix bug of q_layer_inputs by @n1ck-guo in #811
- fix gptqmodel inference issue by @wenhuach21 in #813
- Bump version to v0.7.0 by @XuehaoSun in #814
- fix nsamples in get_dataloader by @wenhuach21 in #804
- Refine logger and add envs by @yiliu30 in #817
- Fix llm-compressor export by @Kaihui-intel in #820
- enhance auto-round eval with vllm backend by @xin3he in #815
- rm triton from requirements and correct the supported python version to 3.10(+) by @wenhuach21 in #824
- move environment variable setting into eval function by @xin3he in #829
- bump version to 0.8.0.dev by @XuehaoSun in #830
- [STEP 1] merge all api(MLLM, Adam) into AutoRound by @n1ck-guo in #791
- add support for scheme FP8_STATIC to export llm_compressor format by @n1ck-guo in #816
- fix format checking bug by @WeiweiZhang1 in #836
- MXFP4 and MXFP8 loading support by @yiliu30 in #832
- hpu build with auto_round package name by @chensuyue in #838
- fix hpu detect issue by @xin3he in #823
- fix severe vram leak regression in auto-round format packing by @wenhuach21 in #842
- fix tp device issue caused by device_map by @xin3he in #833
- fix log error by @n1ck-guo in #843
- [High Risk]Refine inference code by @wenhuach21 in #840
- fix gguf fp8 input model and vram issue by @wenhuach21 in #844
- NVFP4 Loading support by @yiliu30 in #839
- fix extra config by @n1ck-guo in #847
- change the method of detecting linear by @n1ck-guo in #849
- fix device_map setting by @Kaihui-intel in #854
- Add typo checker by @XuehaoSun in #846
- fix parse layer config bug by @wenhuach21 in #856
- Refine
BackendInfoto include act fields by @yiliu30 in #848 - fix bug of data_type fp8_sym by @n1ck-guo in #855
- fix save_quantied format cheaker by @WeiweiZhang1 in #857
- fix bug of get_layer_names_in_block by @wenhuach21 in #861
- raise vlm loading error by @wenhuach21 in #863
- fix FP8 model as input and backend issue by @wenhuach21 in #864
- fix seqlen bug and calib slow of mllm tuning by @n1ck-guo in #871
- fix device bug by @xin3he in #873
- fix vllm backend evaluation by @xin3he in #872
- Optimize CPU unit test workflow by @XuehaoSun in #881
- Fix Cuda CI failures due to Transformers and AWQ incompatibility by @WeiweiZhang1 in #882
- Support Flux quantization by @mengniwang95 in #850
- fp8 exporting bugfix by @WeiweiZhang1 in #874
- lm_eval stop try except and add back missing arguments by @xin3he in #884
- Fix act calibration bug by @mengniwang95 in #880
- restrict accelerate version by @wenhuach21 in #885
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #868
- update require accelerate version by @n1ck-guo in #888
Full Changelog: v0.7.1...v0.8.0
v0.7.1 patch release
fix severe vram leak regression in auto-round format packing @ #842
v0.7.0
🚀 Highlights
-
Enhanced NVFP4 algorithm and added support to export MXFP4/NVFP4 to the
llm-compressorformat
by @WeiweiZhang1 and @wenhuach21 -
Improved W2A16 quantization algorithm
by @wenhuach21 -
Introduced the
schemeinterface for easier configuration of quantization settings
by @wenhuach21 -
Added support for using FP8 models as input and str name as model input in API
by @wenhuach21 and @n1ck-guo -
Unified
deviceanddevice_maparguments and introduceddevice_map="auto"
to simplify quantization of extremely large models
by @Kaihui-intel
What's Changed
- fix ut import issue by @WeiweiZhang1 in #686
- support to export static afp8 model by @n1ck-guo in #662
- Add ruff and isort by @XuehaoSun in #578
- Improved log message for unsupported dataset by @wenhuach21 in #688
- support rceil for mxfp by @wenhuach21 in #660
- Add black and blacken-docs in pre-commit by @XuehaoSun in #692
- support static global scale for nvfp4 and update readme by @wenhuach21 in #691
- Update readme by @wenhuach21 in #695
- Add script for cuda unit test by @XuehaoSun in #567
- support to save image_processor by @n1ck-guo in #694
- support for static activation quantization calibration with group_size by @n1ck-guo in #693
- fix xpu oom checker by @n1ck-guo in #705
- FIXBUG: CPU Offloading for Cache Blocks in Low-Memory GPU Systems or Single GPU on ROCM Configs by @JartX in #703
- fix bug of zero accuracy for mx-fp by @n1ck-guo in #709
- catch oom error and move to cpu directly by @n1ck-guo in #708
- code optimization of vlm by @n1ck-guo in #704
- fix critic bug of gguf tuning by @wenhuach21 in #710
- support fp8 model and str as input in llm quantization by @wenhuach21 in #699
- change act_scale to input_scale for fp8 export by @n1ck-guo in #711
- simply CpuInfo class by @wenhuach21 in #715
- Update step_by_step.md by @wenhuach21 in #717
- fix bug of activation quant when act_max is None by @n1ck-guo in #718
- Bump transformers in /test/test_cuda by @dependabot[bot] in #719
- Freeze torchvision version in CI by @XuehaoSun in #720
- update autoround mllm and support Mistral 3.2 series by @n1ck-guo in #713
- Fix hpu CI by @XuehaoSun in #723
- fix fp8 model input issue by @wenhuach21 in #724
- update gguf convert.py and support for gpt-oss by @n1ck-guo in #721
- new cast_to_nvfp4 with high performance by @xin3he in #727
- make the tuning deterministic and move infrequently used arguments to kwargs by @wenhuach21 in #726
- add original convert file and support for the newest llama.cpp by @n1ck-guo in #729
- fix bug for exporting afp8 fake format by @n1ck-guo in #731
- Fix torch_zp infer bug & API disable_deterministic_algorithms bug by @WeiweiZhang1 in #733
- fix gguf mistral_common import by @n1ck-guo in #736
- Enable mxfp exporting by @WeiweiZhang1 in #649
- support for glm4.5 gguf by @n1ck-guo in #735
- support auto-round-mllm command by @n1ck-guo in #742
- Optimize pack zeros for int sym by @WeiweiZhang1 in #743
- fix UT check for int zp by @WeiweiZhang1 in #745
- support llama4 quant by @mengniwang95 in #744
- fix bug of loading fp8 model by @n1ck-guo in #747
- improved algorithm for int2 by @wenhuach21 in #748
- Add Static FP8 KV Support by @yiliu30 in #737
- refine code by @wenhuach21 in #749
- mllm supports loading fp8 model and fix bug of loading fp8 model by @n1ck-guo in #750
- support deepspeed LinearLayer and LinearAllreduce by @xin3he in #698
- fix alg_ext moe and model str input bug by @wenhuach21 in #751
- api support for fp8 model and mllm api support load from str by @n1ck-guo in #752
- fix some torch compile warnings by @wenhuach21 in #755
- Speedup FP4 packing by @yiliu30 in #760
- fix_script_fp_layer_config_for_bits_checking by @WeiweiZhang1 in #756
- support quant lm_head for rtn w8afp8 static quant by @n1ck-guo in #754
- Revert "Speedup FP4 packing" by @yiliu30 in #763
- refine code and fix activation quantization eval regression by @wenhuach21 in #762
- fix gguf ut bug by @n1ck-guo in #767
- fix gguf bug by int zp by @n1ck-guo in #771
- Keep the model’s buffer dtype unchanged in most cases by @wenhuach21 in #770
- fix set_layer_config bug by @wenhuach21 in #768
- fix bug of auto_round exporting by @n1ck-guo in #772
- gguf format supports for fp8 model by @n1ck-guo in #778
- [API CHANGE] Stage 1 add quant scheme and consolidate device and device_map by @wenhuach21 in #774
- Speedup FP4 packing by @yiliu30 in #766
- hot fix for nvfp4 scheme by @wenhuach21 in #784
- fix alg_ext regression and support mxfp4 in it with slight improvement by @wenhuach21 in #785
- refine nvfp code, typofix by @WeiweiZhang1 in #777
- mxfp/nvfp/fp8 support torch compile in tuning by @wenhuach21 in #789
- refine nvfp4 algorithm by @wenhuach21 in #790
- add limit arg for eval by @n1ck-guo in #764
- torch backend bugfix and speedup ut by @WeiweiZhang1 in #793
- Support auto device mapping by @Kaihui-intel in #781
- fix bug and add nvfp in alg-ext with slight improvement by @wenhuach21 in #794
- rename llmcompressor to llm_compressor for align with other formats by @WeiweiZhang1 in #780
- align formats packing device to API by @WeiweiZhang1 in #795
- add fp8 export format check by @n1ck-guo in #779
- fix several regressions including lm-head quantization, 3bit asym torch backend,etc by @wenhuach21 in #796
- refine readme by @wenhuach21 in #798
- fix typo in readme by @wenhuach21 in #799
- fix several cuda ut bug by @n1ck-guo in #797
- enable model python files saving by @WeiweiZhang1 in #802
- AutoRoundMLLM supports scheme and fix device_map=dict regression by @n1ck-guo in #801
- improve the robustness of scheme by @wenhuach21 in #803
- fix mxfp exporting by @WeiweiZhang1 in #806
New Contributors
- @JartX made their first contribution in #703
- @mengniwang95 made their first contribution in #744
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Highlights
- provide experimental support for gguf q*_k format and customized mixed bits setting
- support xpu in triton backend by @wenhuach21 in #563
- add torch backend by @WeiweiZhang1 in #555
- provide initial support of llmcompressor format, only INT8 W8A8 dynamic quantization is supported by @xin3he in #646
What's Changed
- bump version into v0.5.1 by @XuehaoSun in #540
- Freeze pytorch & ipex version in CI by @XuehaoSun in #541
- fix_quantization_config_for_inference by @WeiweiZhang1 in #542
- [critic bug]remove redundant round in dq simulation by @wenhuach21 in #543
- update readme by @wenhuach21 in #550
- add recipes for qwen3 8b and 14b by @n1ck-guo in #552
- itrex requires torch<2.7 by @XuehaoSun in #548
- [GGUF STEP4] fix search bug and improve packing & eval speed by @n1ck-guo in #545
- refine xpu requirement/config json and fix several issues by @wenhuach21 in #558
- add UE5M3 simulation by @wenhuach21 in #562
- support xpu in triton backend by @wenhuach21 in #563
- fix typo in backend by @wenhuach21 in #564
- update habana docker to 1.21.0 by @XuehaoSun in #566
- Support for more gguf format and float zp for Q*_1 by @n1ck-guo in #560
- update readme by @wenhuach21 in #569
- update readme by @wenhuach21 in #571
- support for llava-based hf model by @n1ck-guo in #568
- add gguf accuracy data by @wenhuach21 in #574
- add sym & asym gguf quant for gguf baseline (iter==0) by @n1ck-guo in #573
- modify default asym 4bits auto-round format to awq, fix save folder typo for mllm by @WeiweiZhang1 in #575
- improve the robustness of parsing vlm config by @wenhuach21 in #577
- switch to transformers API in cpu ut by @wenhuach21 in #580
- add torch backend by @WeiweiZhang1 in #555
- fix awq exporting at group_size=-1 by @wenhuach21 in #579
- refact cuda ut to facilitate automation by @n1ck-guo in #559
- fix tensor shape mismatch error for API usage by @WeiweiZhang1 in #582
- fix device bug at calibration by @wenhuach21 in #587
- Update gguf_accuracy (q3_ks) by @SinpackKonmakan in #590
- add recipes for deepseek-r1-0528 by @n1ck-guo in #588
- correct errors of deepseek-r1-0528 recipes by @n1ck-guo in #591
- fix cuda ut by @wenhuach21 in #592
- Bump protobuf from 3.20.1 to 3.20.2 in /test/test_cuda by @dependabot[bot] in #585
- rm unnecessary forward to improve speed by @wenhuach21 in #593
- update readme by @wenhuach21 in #597
- fix q2k bug by @n1ck-guo in #599
- support for q4_k_m by @n1ck-guo in #596
- fix vlm uttest path error by @WeiweiZhang1 in #601
- fix lots of gguf critic bugs and support imatrix in rtn mode by @wenhuach21 in #595
- fix gguf bug by @wenhuach21 in #610
- mv some checkers by @wenhuach21 in #611
- fix gguf packing bug and moe regression by @wenhuach21 in #614
- support customized mixed bits for gguf by @wenhuach21 in #615
- fix double quant sym bug by @wenhuach21 in #616
- FP8 WOQ export by @wenhuach21 in #617
- fix bug of q5_k_s w/ imatrix by @n1ck-guo in #620
- add auto-round related vllm and transformers UT by @WeiweiZhang1 in #613
- refine_doc_0624 by @WeiweiZhang1 in #619
- fix not using imatrix for gguf at rtn mode by @wenhuach21 in #623
- fix vlm hf config loading issue by @WeiweiZhang1 in #624
- refine gguf rtn algorithm and fix bugs by @wenhuach21 in #630
- fix gguf bug of moe models and lmhead/embedding bits setting regression by @n1ck-guo in #628
- [BUG FIX] fix bug of deepseek gguf:q*k by @n1ck-guo in #637
- support packing immediately for gguf to reduce ram usage by @wenhuach21 in #638
- support llmcompressor format by @xin3he in #646
- fix norm_bias_tuning by @wenhuach21 in #639
- [W4A8]Fix Packing by @yiliu30 in #648
- Integrate RTN quantization into GGUF packing to enhance robustness by @n1ck-guo in #644
- Remove vlm cuda UT dependencies version restrictions by @XuehaoSun in #651
- speedup mxfp tuning and fix nvfp bug by @wenhuach21 in #647
- support two more calib datasets and fix embedding layer bug by @wenhuach21 in #653
- fix some issues by @wenhuach21 in #655
- fix bug of q4_0 and q5_0 at iters==0 by @n1ck-guo in #658
- support vlm models for gguf format by @n1ck-guo in #654
- fix bug of block-wise quant imatrix by @n1ck-guo in #663
- fix gguf block-wise issue by @wenhuach21 in #664
- fix bugs of export deepseek gguf format when iters=0 and q3k accuracy by @n1ck-guo in #665
- handle zeros in imatrix by @wenhuach21 in #667
- fix ut issue by @WeiweiZhang1 in #668
- fix cuda hanging issue during packing by @WeiweiZhang1 in #669
- support to use lm_eval for vlm by @n1ck-guo in #670
- add trust remote code to gguf format load tokenizer by @n1ck-guo in #675
- fix 3bits asym accuracy and calib dataset issues by @WeiweiZhang1 in #674
- restrict accelerate version to reduce ram usage by @wenhuach21 in #673
- rm low_cpu when loading the model by @wenhuach21 in #676
- rm_old_vlm_cuda_ut by @WeiweiZhang1 in #678
- update gguf convert file and fix bug of permute bug by @n1ck-guo in #679
- fix gguf regression for large models by @wenhuach21 in #680
- fix gemma vlm gguf regression by @wenhuach21 in #685
New Contributors
- @SinpackKonmakan made their first contribution in #590
- @xin3he made their first contribution in #646
Full Changelog: v0.5.1...v0.6.0
v0.5.1:bug fix release
What's Changed
- bump version into v0.5.0 by @XuehaoSun in #538
- fix triton multiple gpus and some other issues by @wenhuach21 in #539
Full Changelog: v0.5.0...v0.5.1