-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Labels
bugSomething isn't workingSomething isn't working
Description
I'm having difficulty quantizing Qwen3 VL 30B-A3B MoE model. I'm using 4xA100 80GB VRAM. However, I'm encountering cuda out-of-memory error.
From the logs and while monitoring the GPU VRAM usage, I noticed the VRAM of 1 gpu is exhausted during the step "start to cache block inputs". Then later it hit the error CUDA oom and RuntimeError: Expected all tensors to be on the same device, but got index is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__index_select).
During this stage, I notice only 1 GPU is really used, while the other 3 barely has any VRAM occupied. Was hoping someone can help to point to the right direction.
Command I used:
CUDA_VISIBLE_DEVICES=0,1,2,3 auto-round --model "Qwen/Qwen3-VL-30B-A3B-Instruct" --scheme "W4A16" --format auto_round --output_dir ./output_llmcompressor --device_map "auto" --enable_torch_compile
Logs:
2026-01-01 16:27:26 INFO __main__.py L529: `torch.compile` is enabled to reduce tuning costs. If it causes issues, you can disable it by removing `--enable_torch_compile` argument.
2026-01-01 16:27:26 INFO __main__.py L537: start to quantize Qwen/Qwen3-VL-30B-A3B-Instruct
2026-01-01 16:27:26 INFO autoround.py L158: using MLLM mode for multimodal model.
2026-01-01 16:27:27 WARNING modeling_utils.py L4670: `torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 128.74it/s]
2026-01-01 16:27:36 INFO base.py L388: using torch.bfloat16 for quantization tuning
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.0.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.0.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.1.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.1.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.2.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.2.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.3.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.3.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.4.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.4.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.5.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.5.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.6.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.6.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.7.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.7.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.8.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.8.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.9.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.9.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.10.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.10.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.11.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.11.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.12.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.12.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.13.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.13.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.14.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.14.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.15.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.15.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.16.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.16.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.17.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.17.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.18.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.18.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.19.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.19.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.20.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.20.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.21.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.21.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.22.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.22.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.23.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.23.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.24.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.24.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.25.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.25.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.26.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.26.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 INFO formats.py L120: change `scale_dtype` to `torch.float32` for gguf format
2026-01-01 16:27:38 INFO replace_modules.py L164: Found 48 modules to replace
Replacing modules: 100%|███████████████████████████████████████████████████████████████████████████████████| 48/48 [02:54<00:00, 3.64s/it]
2026-01-01 16:30:33 INFO replace_modules.py L178: Replaced 48 modules
2026-01-01 16:30:36 INFO base.py L1451: start to cache block inputs
2026-01-01 16:31:33 INFO base.py L1953: switch to cpu to cache block inputs
cache block inputs: 0%| | 0/128 [00:00<?, ?it/s]
2026-01-01 16:31:53 ERROR base.py L1970: Traceback (most recent call last):
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1929, in try_cache_inter_data_gpucpu
self.model = dispatch_model(self.model, device_map=device_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
[Previous line repeated 1 more time]
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 634, in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in add_hook_to_module
module = hook.init_hook(module)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 288, in init_hook
set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map)
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/utils/modeling.py", line 335, in set_module_tensor_to_device
new_value = old_value.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 10.75 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 71.13 GiB is allocated by PyTorch, and 7.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1929, in try_cache_inter_data_gpucpu
self.model = dispatch_model(self.model, device_map=device_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
[Previous line repeated 1 more time]
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 634, in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in add_hook_to_module
module = hook.init_hook(module)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 288, in init_hook
set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map)
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/utils/modeling.py", line 335, in set_module_tensor_to_device
new_value = old_value.to(device, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 10.75 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 71.13 GiB is allocated by PyTorch, and 7.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/akk/yh/uvenv/quantization/bin/auto-round", line 10, in <module>
sys.exit(run())
^^^^^
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 911, in run
start()
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 485, in start
tune(args)
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 680, in tune
model, folders = autoround.quantize_and_save(export_dir, format=args.format) # pylint: disable=E1101
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 818, in quantize_and_save
model, _ = self.quantize()
^^^^^^^^^^^^^^^
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1452, in quantize
all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1966, in try_cache_inter_data_gpucpu
all_inputs = self.cache_inter_data(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 2014, in cache_inter_data
self.calib(nsamples, calib_bs)
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/mllm/compressor.py", line 411, in calib
raise error
File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/mllm/compressor.py", line 407, in calib
self.model(**data_new)
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/utils/generic.py", line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 1601, in forward
outputs = self.model(
^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/utils/generic.py", line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 1298, in forward
inputs_embeds = self.get_input_embeddings()(input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 192, in forward
return F.embedding(
^^^^^^^^^^^^
File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/functional.py", line 2542, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__index_select)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working