Skip to content

Conversation

@JP-Amboage
Copy link
Collaborator

@JP-Amboage JP-Amboage commented Dec 2, 2025

Reason for this PR

Currently in brevitas_examples/llm when GPTQ, GPFQ or Qronos are applied there is a need to run separate LLM forward passes for each of the samples in the calibration data. By performing these forward passes per batch instead of per sample it is possible to save considerable time when running these algorithms.

For example, for Llama-3.2-1B applying GPTQ with the current setup and a calibration size of 512 takes around 1h whereas using the same configuration but processing the calibration data in batches of 128 samples reduces the runtime of the algorithm to ~0.5h (similar or greater speedups were observed for Qronos and GPFQ).

In private discussions during the PR review it was decided to extend the use of batched inputs to the other algorithms in brevitas_examples/llm.main/py which could directly support it.

Changes Made in this PR

  • Added --calibration-batchsize as a new argument for LLM experiments allowing to choose the batch size for the forward pass of the LLM in GPXQ/Qronos algorithms as well as in the other algorithms supporting batched inputs. The default value is 1 to mimic the existing setup.
  • Added a function in llm_quant/data_utils.py to build a DataLoader from a DatasetToDevice which is the class currently used to handle the calibration data. . Deleted after PR review this is now done directly in main.py.
  • Added a custom collate function in llm_quant/data_utils.py that is needed to create the DataLoader.
  • In main.py added code to instantiate and use the DataLoader, the data loader is stored in the variable calibration_loader and the dataset (previously named calibration_loader) is now named calibration_dataset. Similarly, the variable validation_dataloader (which was actually storing a dataset) has been renamed to validation_dataset.
  • Type hints were rectified in some of the methods that required a dataset but were typed as DataLoader.
  • In src/brevitas_examples/llm/llm_quant/rotation_optimization.py removed the collate_fn to avoid having duplicated code and added instead a new function named data_collator accommodating to Hugging Face's interface that internally relies in the collate function that this PR adds in llm_quant/data_utils.py.
  • Added a unit test to tests/brevitas_examples/test_llm_data.py to ensure the DataLoader is built correctly.

Warning

Had to edit the quantized perplexity values in some test cases as for Pytorch inner reasons the random state changes after calling iter(calibration_loader) (even when shuffle is False, and everything is set to be deterministic in the data loader). The change in the random state affected some algorithms, as for example random matrices may be used.

Testing Summary

  • Tests in tests/brevitas_examples/test_llm_data.py including the new test for the DataLoader were run locally.
  • Python debugger was used to inspect that the creation of the Hessian matrix was correct in the batched version (i.e. that it matched the values in the original code). Small discrepancies were found, when using a deterministic environment and setting the precision in the computation of the Hessian to be float64 no differences were observed.
  • The additional sanity checks performed following the PR review provided the following results for HuggingFaceTB/SmolLM-135M with weight only quantization to int4 for wikitext2 dataset:
Method Batch Size Run Time (s) Quantized Perplexity
GPTQ 1 7249 19.125
GPTQ 64 1059 19.125
GPFQ 1 29650 19.500
GPFQ 64 1880 19.500
Qronos 1 30073 18.875
Qronos 64 1957 18.875

@pablomlago pablomlago self-requested a review December 2, 2025 14:03
JP-Amboage and others added 3 commits December 2, 2025 14:10
Co-authored-by: Pablo Monteagudo Lago <44771380+pablomlago@users.noreply.github.com>
Co-authored-by: Pablo Monteagudo Lago <44771380+pablomlago@users.noreply.github.com>
Co-authored-by: Pablo Monteagudo Lago <44771380+pablomlago@users.noreply.github.com>
@pablomlago pablomlago requested a review from Giuseppe5 December 2, 2025 14:36
Copy link
Collaborator

@pablomlago pablomlago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll wait for @Giuseppe5 to review it before merging.

@Giuseppe5
Copy link
Collaborator

Update the description above to match the new version.
Also, since we had to change the results of the tests, would you mind running some sanity checks, comparing a few (max 2/3 configurations) with batch size = 1 and batch size > 1?
I would run all the quantization configurations that benefit from higher batch size.

You could paste the results in the PR description above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants