-
Notifications
You must be signed in to change notification settings - Fork 442
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
After starting 4 Pods, when creating additional Pods, they end up in the UnexpectedAdmissionError state.
kubectl describe pod shows:
Annotations: hami.io/bind-phase: allocating
hami.io/bind-time: 1766199067
hami.io/vgpu-node: xxxhost.node
hami.io/vgpu-time: 1766199067
hami.io/xpu-devices-allocated: 00000000-03-00.0,XPU,24576,25:;
hami.io/xpu-devices-to-allocate: 00000000-03-00.0,XPU,24576,25:;
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to rpc error: code = Unknown desc = empty id from pod annotation, which is unexpected
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 57s hami-scheduler Successfully assigned default/xpu-test-644668fd7b-dgcff to xxxhost.node
Normal FilteringSucceed 58s hami-scheduler find fit node(xxxhost.node), 0 nodes not fit, 1 nodes fit(xxxhost.node:9.69)
Normal BindingSucceed 58s hami-scheduler Successfully binding node [xxxhost.node] to default/xpu-test-644668fd7b-dgcff
Warning UnexpectedAdmissionError 58s kubelet Allocate failed due to rpc error: code = Unknown desc = empty id from pod annotation, which is unexpected
xpu-device-plugin Logs:
I1220 02:51:11.756934 248 server.go:256] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[11-2],},},}
I1220 02:51:11.760858 248 util.go:130] node annotation key is hami.io/mutex.lock, value is 2025-12-20T02:51:11Z,default,xpu-test-644668fd7b-ndgp2
I1220 02:51:11.762571 248 util.go:351] checklist is [map[XPU:hami.io/xpu-devices-to-allocate]], annos is [map[hami.io/bind-phase:allocating hami.io/bind-time:1766199070 hami.io/vgpu-node:xxxhost.node hami.io/vgpu-time:1766199069 hami.io/xpu-devices-allocated:00000000-03-00.0,XPU,24576,25:; hami.io/xpu-devices-to-allocate:00000000-03-00.0,XPU,24576,25:;]]
I1220 02:51:11.762596 248 util.go:326] Start to decode container device 00000000-03-00.0,XPU,24576,25:
I1220 02:51:11.762603 248 util.go:346] Finished decoding container devices. Total devices: 1
I1220 02:51:11.762609 248 util.go:373] Decoded pod annos poddevices map[XPU:[[{0 00000000:03:00.0 XPU 24576 25}]]]
I1220 02:51:11.762620 248 server_hami.go:171] pod annotation decode vaule is map[XPU:[[{Idx:0 UUID:00000000:03:00.0 Type:XPU Usedmem:24576 Usedcores:25}]]]
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
I1220 02:51:11.958755 248 xpu.go:193] device set sriov vf num, busID: 00000000:03:00.0, mode: 2, dmemory: 98304, cmemory: 24576, free: %!b(bool=false), split: 4
I1220 02:51:11.958774 248 server_hami.go:144] uuid not match xpuId: 00000000:63:00.0, containerId: 00000000:03:00
I1220 02:51:11.958781 248 server_hami.go:144] uuid not match xpuId: 00000000:65:00.0, containerId: 00000000:03:00
I1220 02:51:11.958786 248 server_hami.go:144] uuid not match xpuId: 00000000:83:00.0, containerId: 00000000:03:00
I1220 02:51:11.958791 248 server_hami.go:144] uuid not match xpuId: 00000000:85:00.0, containerId: 00000000:03:00
I1220 02:51:11.958797 248 server_hami.go:144] uuid not match xpuId: 00000000:A3:00.0, containerId: 00000000:03:00
I1220 02:51:11.967755 248 util.go:200] Node lock released, node: xxxhost.node, err: <nil>
GPU Status:
All eight P800 GPUs appear to be functioning normally.
However, it seems that only 6 virtual devices are being created.
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 0
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 1
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 2
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 3
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 4
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 5
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 6
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 7
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 8
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 9
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 10
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 11
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 12
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 13
Pod Resource Request and xpu-smi Output:
The Pod requests only 1 device, but actually seems to have 2 mapped in:
resources:
limits:
kunlunxin.com/vxpu: "1"
(py38) root@xxxhost:/workspace# xpu-smi
Sat Dec 20 10:56:03 2025
+-----------------------------------------------------------------------------+
| XPU-SMI Driver Version: 5.0.21.40 XPU-RT Version: 5.0.21 |
|-------------------------------+----------------------+----------------------+
| XPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | XPU-Util Compute M. |
| | L3-Usage | SR-IOV M. |
|===============================+======================+======================|
| 0 P800 OAM N/A | 00000000:05:00.0 N/A | 0 |
| N/A 41C N/A 89W / 400W | 48932MiB / 98304MiB | 0% Default |
| | 64MiB / 96MiB | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 P800 OAM N/A | 00000000:85:00.0 N/A | 0 |
| N/A 41C N/A 87W / 400W | 0MiB / 98304MiB | 0% Default |
| | 0MiB / 96MiB | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| XPU XI CI PID Type Process name XPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 43 C ...onda/envs/py38/bin/python 48996MiB |
+-----------------------------------------------------------------------------+
Versions:
hami:v2.7.1
vxpu-device-plugin:v1.0.0
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working