Skip to content

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #2022

@dogukan1047

Description

@dogukan1047

Description

Problem

We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:

  • 1x 4g.40gb
  • 1x 2g.20gb
  • 1x 1g.20gb

We are using the gpu-burn container image to fully stress the 4g.40gb MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.

MIG Configuration

mig-enabled: true
mig-devices:
  "4g.40gb": 1
  "2g.20gb": 1
  "1g.20gb": 1

Pod Manifest

apiVersion: v1
kind: Pod
metadata:
  name: gpu-stress-burn-single
  namespace: mlops-development
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: iankoulski/gpuburn
      command: ["/app/gpu_burn"]
      args:
        - "-tc"        # Tensor Core enabled
        - "-d"         # Double Precision enabled
        - "14400"      # 4 hours (in seconds)
      resources:
        limits:
          nvidia.com/mig-4g.40gb: "1"
        requests:
          nvidia.com/mig-4g.40gb: "1"

Expected Behavior

  • While gpu-burn is running:
    • DCGM metrics should show ~100% GPU utilization for the allocated MIG instance

Actual Behavior

  • Pod starts and gpu_burn runs successfully
  • However:
    • GPU utilization in DCGM metrics appears low
    • It never reaches ~100%

Questions / Suspicions

  1. Does DCGM exporter report utilization differently for MIG devices?
  2. Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
  3. Is a multi-process workload or additional flags required?

Additional Context

  • GPU: NVIDIA A100
  • MIG: Enabled
  • Environment: Kubernetes with NVIDIA device plugin
Image
  • Monitoring: DCGM Exporter

Screenshots

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions