1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..

**What happened**:
worker:
replicas: 1
vgpu:
enabled: true # <- 新增这一行，开启 HAMi vGPU
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
nvidia.com/gpu: "4"
nvidia.com/gpucores: 10 # 占用 15% GPU 核心
nvidia.com/gpumem-percentage: 5 # 占用 10% 显存
ports:
- containerPort: 30001
protocol: TCP


 Command:
      xinference-worker
    Args:
      -e
      http://service-supervisor:9997
      --host
      $(POD_IP)
      --worker-port
      30001
      --log-level
      debug
    Limits:
      nvidia.com/gpu:                4
      nvidia.com/gpucores:           10
      nvidia.com/gpumem-percentage:  5
    Requests:
      cpu:                           2
      memory:                        8Gi
      nvidia.com/gpu:                4
      nvidia.com/gpucores:           10
      nvidia.com/gpumem-percentage:  5
    Environment:
      POD_IP:                            (v1:status.podIP)
      XINFERENCE_PROCESS_START_METHOD:  <set to the key 'XINFERENCE_PROCESS_START_METHOD' of config map 'xinferencecluster-distributed-config'>  Optional: false
      XINFERENCE_MODEL_SRC:             <set to the key 'XINFERENCE_MODEL_SRC' of config map 'xinferencecluster-distributed-config'>             Optional: false
      XINFERENCE_HOME:                  <set to the key 'XINFERENCE_HOME' of config map 'xinferencecluster-distributed-config'>                  Optional: false
    Mounts:
      /dev/shm from shm (rw)
      /opt/xinference from xinference-home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-flwq4 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  xinference-home:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  xinferencecluster-distributed-home-pvc
    ReadOnly:   false
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  128Gi
  kube-api-access-flwq4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From            Message
  ----     ------            ----  ----            -------
  Warning  FailedScheduling  67s   hami-scheduler  0/4 nodes are available: 1 NodeUnfitPod, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 2 node unregistered. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
  Warning  FilteringFailed   67s   hami-scheduler  no available node, 3 nodes do not meet


nvidia.com/gpu: "4"    If there are more than two, an error will be reported
**What you expected to happen**: Enable and allocate VGPU normally

**How to reproduce it (as minimally and precisely as possible)**:

**Anything else we need to know?**:

- The output of `nvidia-smi -a` on your host
- Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`)
- The hami-device-plugin container logs
- The hami-scheduler container logs
- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version:
- nvidia driver or other AI device driver version:
- Docker version from `docker version`
- Docker command, image and tag used
- Kernel version from `uname -a`
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.. #1528

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.. #1528

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions