Skip to content

1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.. #1528

@zwt-1234

Description

@zwt-1234

What happened:
worker:
replicas: 1
vgpu:
enabled: true # <- 新增这一行,开启 HAMi vGPU
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
nvidia.com/gpu: "4"
nvidia.com/gpucores: 10 # 占用 15% GPU 核心
nvidia.com/gpumem-percentage: 5 # 占用 10% 显存
ports:

  • containerPort: 30001
    protocol: TCP

Command:
xinference-worker
Args:
-e
http://service-supervisor:9997
--host
$(POD_IP)
--worker-port
30001
--log-level
debug
Limits:
nvidia.com/gpu: 4
nvidia.com/gpucores: 10
nvidia.com/gpumem-percentage: 5
Requests:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 4
nvidia.com/gpucores: 10
nvidia.com/gpumem-percentage: 5
Environment:
POD_IP: (v1:status.podIP)
XINFERENCE_PROCESS_START_METHOD: <set to the key 'XINFERENCE_PROCESS_START_METHOD' of config map 'xinferencecluster-distributed-config'> Optional: false
XINFERENCE_MODEL_SRC: <set to the key 'XINFERENCE_MODEL_SRC' of config map 'xinferencecluster-distributed-config'> Optional: false
XINFERENCE_HOME: <set to the key 'XINFERENCE_HOME' of config map 'xinferencecluster-distributed-config'> Optional: false
Mounts:
/dev/shm from shm (rw)
/opt/xinference from xinference-home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-flwq4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
xinference-home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: xinferencecluster-distributed-home-pvc
ReadOnly: false
shm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 128Gi
kube-api-access-flwq4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message


Warning FailedScheduling 67s hami-scheduler 0/4 nodes are available: 1 NodeUnfitPod, 1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 2 node unregistered. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
Warning FilteringFailed 67s hami-scheduler no available node, 3 nodes do not meet

nvidia.com/gpu: "4" If there are more than two, an error will be reported
What you expected to happen: Enable and allocate VGPU normally

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions