Feat (brevitas_examples/llm): Support for batched inputs in GPXQ/Qronos forward passes. #1427

JP-Amboage · 2025-12-02T10:07:22Z

Reason for this PR

Currently in brevitas_examples/llm when GPTQ, GPFQ or Qronos are applied there is a need to run separate LLM forward passes for each of the samples in the calibration data. By performing these forward passes per batch instead of per sample it is possible to save considerable time when running these algorithms.

For example, for Llama-3.2-1B applying GPTQ with the current setup and a calibration size of 512 takes around 1h whereas using the same configuration but processing the calibration data in batches of 128 samples reduces the runtime of the algorithm to ~0.5h (similar or greater speedups were observed for Qronos and GPFQ).

In private discussions during the PR review it was decided to extend the use of batched inputs to the other algorithms in brevitas_examples/llm.main/py which could directly support it.

Changes Made in this PR

Added --calibration-batchsize as a new argument for LLM experiments allowing to choose the batch size for the forward pass of the LLM in GPXQ/Qronos algorithms as well as in the other algorithms supporting batched inputs. The default value is 1 to mimic the existing setup.
~~Added a function in llm_quant/data_utils.py to build a DataLoader from a DatasetToDevice which is the class currently used to handle the calibration data.~~ . Deleted after PR review this is now done directly in main.py.
Added a custom collate function in llm_quant/data_utils.py that is needed to create the DataLoader.
In main.py added code to instantiate and use the DataLoader, the data loader is stored in the variable calibration_loader and the dataset (previously named calibration_loader) is now named calibration_dataset. Similarly, the variable validation_dataloader (which was actually storing a dataset) has been renamed to validation_dataset.
Type hints were rectified in some of the methods that required a dataset but were typed as DataLoader.
In src/brevitas_examples/llm/llm_quant/rotation_optimization.py removed the collate_fn to avoid having duplicated code and added instead a new function named data_collator accommodating to Hugging Face's interface that internally relies in the collate function that this PR adds in llm_quant/data_utils.py.
Added a unit test to tests/brevitas_examples/test_llm_data.py to ensure the DataLoader is built correctly.

Warning

Had to edit the quantized perplexity values in some test cases as for Pytorch inner reasons the random state changes after calling iter(calibration_loader) (even when shuffle is False, and everything is set to be deterministic in the data loader). The change in the random state affected some algorithms, as for example random matrices may be used.

Testing Summary

Tests in tests/brevitas_examples/test_llm_data.py including the new test for the DataLoader were run locally.
Python debugger was used to inspect that the creation of the Hessian matrix was correct in the batched version (i.e. that it matched the values in the original code). Small discrepancies were found, when using a deterministic environment and setting the precision in the computation of the Hessian to be float64 no differences were observed.
The additional sanity checks performed following the PR review provided the following results for HuggingFaceTB/SmolLM-135M with weight only quantization to int4 for wikitext2 dataset:

Method	Batch Size	Run Time (s)	Quantized Perplexity
GPTQ	1	7249	19.125
GPTQ	64	1059	19.125
GPFQ	1	29650	19.500
GPFQ	64	1880	19.500
Qronos	1	30073	18.875
Qronos	64	1957	18.875

tests/brevitas_examples/test_llm_data.py

src/brevitas_examples/llm/llm_args.py

Co-authored-by: Pablo Monteagudo Lago <44771380+pablomlago@users.noreply.github.com>

src/brevitas_examples/llm/llm_quant/data_utils.py

src/brevitas_examples/llm/main.py

…into batched-inp

tests/brevitas_examples/test_llm_data.py

pablomlago

LGTM, I'll wait for @Giuseppe5 to review it before merging.

Giuseppe5 · 2025-12-22T14:48:00Z

Update the description above to match the new version.
Also, since we had to change the results of the tests, would you mind running some sanity checks, comparing a few (max 2/3 configurations) with batch size = 1 and batch size > 1?
I would run all the quantization configurations that benefit from higher batch size.

You could paste the results in the PR description above

Juan P Garcia Amboage added 4 commits November 28, 2025 11:48

support for batched data inputs for GPXQ and Qronos

ab08231

removed block of code introduced for debugging

23d7a67

removed breakpoint

d382e31

pre-commit

46cf7e2

pablomlago self-requested a review December 2, 2025 14:03