Skip to content

Failed to quantize Qwen3 VL MoE #1211

@noobHappylife

Description

@noobHappylife

I'm having difficulty quantizing Qwen3 VL 30B-A3B MoE model. I'm using 4xA100 80GB VRAM. However, I'm encountering cuda out-of-memory error.

From the logs and while monitoring the GPU VRAM usage, I noticed the VRAM of 1 gpu is exhausted during the step "start to cache block inputs". Then later it hit the error CUDA oom and RuntimeError: Expected all tensors to be on the same device, but got index is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__index_select).

During this stage, I notice only 1 GPU is really used, while the other 3 barely has any VRAM occupied. Was hoping someone can help to point to the right direction.

Command I used:

CUDA_VISIBLE_DEVICES=0,1,2,3 auto-round --model "Qwen/Qwen3-VL-30B-A3B-Instruct" --scheme "W4A16" --format auto_round --output_dir ./output_llmcompressor --device_map "auto" --enable_torch_compile

Logs:

2026-01-01 16:27:26 INFO __main__.py L529: `torch.compile` is enabled to reduce tuning costs. If it causes issues, you can disable it by removing `--enable_torch_compile` argument.
2026-01-01 16:27:26 INFO __main__.py L537: start to quantize Qwen/Qwen3-VL-30B-A3B-Instruct
2026-01-01 16:27:26 INFO autoround.py L158: using MLLM mode for multimodal model.
2026-01-01 16:27:27 WARNING modeling_utils.py L4670: `torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 128.74it/s]
2026-01-01 16:27:36 INFO base.py L388: using torch.bfloat16 for quantization tuning
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.0.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.0.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.1.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.1.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.2.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.2.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.3.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.3.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.4.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.4.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.5.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.5.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.6.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.6.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.7.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.7.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.8.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.8.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.9.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.9.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.10.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.10.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.11.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.11.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.12.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.12.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.13.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.13.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.14.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.14.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.15.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.15.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.16.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.16.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.17.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.17.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.18.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.18.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.19.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.19.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.20.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.20.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.21.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.21.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.22.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.22.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.23.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.23.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.24.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.24.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.25.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.25.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.26.mlp.linear_fc1 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 WARNING formats.py L142: model.visual.blocks.26.mlp.linear_fc2 skipped quantization (shape not divisible by 32).
2026-01-01 16:27:38 INFO formats.py L120: change `scale_dtype` to `torch.float32` for gguf format
2026-01-01 16:27:38 INFO replace_modules.py L164: Found 48 modules to replace
Replacing modules: 100%|███████████████████████████████████████████████████████████████████████████████████| 48/48 [02:54<00:00,  3.64s/it]
2026-01-01 16:30:33 INFO replace_modules.py L178: Replaced 48 modules
2026-01-01 16:30:36 INFO base.py L1451: start to cache block inputs
2026-01-01 16:31:33 INFO base.py L1953: switch to cpu to cache block inputs
cache block inputs:   0%|                                                                                          | 0/128 [00:00<?, ?it/s]
2026-01-01 16:31:53 ERROR base.py L1970: Traceback (most recent call last):
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1929, in try_cache_inter_data_gpucpu
    self.model = dispatch_model(self.model, device_map=device_map)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  [Previous line repeated 1 more time]
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 634, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in add_hook_to_module
    module = hook.init_hook(module)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 288, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map)
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/utils/modeling.py", line 335, in set_module_tensor_to_device
    new_value = old_value.to(device, non_blocking=non_blocking)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 10.75 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 71.13 GiB is allocated by PyTorch, and 7.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Traceback (most recent call last):
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1929, in try_cache_inter_data_gpucpu
    self.model = dispatch_model(self.model, device_map=device_map)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 676, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
  [Previous line repeated 1 more time]
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 634, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 166, in add_hook_to_module
    module = hook.init_hook(module)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 288, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map)
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/utils/modeling.py", line 335, in set_module_tensor_to_device
    new_value = old_value.to(device, non_blocking=non_blocking)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 10.75 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 71.13 GiB is allocated by PyTorch, and 7.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/akk/yh/uvenv/quantization/bin/auto-round", line 10, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 911, in run
    start()
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 485, in start
    tune(args)
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/__main__.py", line 680, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 818, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1452, in quantize
    all_inputs = self.try_cache_inter_data_gpucpu(all_first_block_names, self.nsamples, layer_names=layer_names)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 1966, in try_cache_inter_data_gpucpu
    all_inputs = self.cache_inter_data(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/base.py", line 2014, in cache_inter_data
    self.calib(nsamples, calib_bs)
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/mllm/compressor.py", line 411, in calib
    raise error
  File "/home/akk/yh/quantize_autoround/auto-round/auto_round/compressors/mllm/compressor.py", line 407, in calib
    self.model(**data_new)
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/utils/generic.py", line 1072, in wrapper
    outputs = func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 1601, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/utils/generic.py", line 1072, in wrapper
    outputs = func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 1298, in forward
    inputs_embeds = self.get_input_embeddings()(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/home/akk/yh/uvenv/quantization/lib/python3.12/site-packages/torch/nn/functional.py", line 2542, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__index_select)

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions