Skip to content

Problem Found When Using HAMi + P800 GPUs for Resource Sharing #1548

@121812

Description

@121812

After starting 4 Pods, when creating additional Pods, they end up in the UnexpectedAdmissionError state.
kubectl describe pod shows:

Annotations:      hami.io/bind-phase: allocating
                  hami.io/bind-time: 1766199067
                  hami.io/vgpu-node: xxxhost.node
                  hami.io/vgpu-time: 1766199067
                  hami.io/xpu-devices-allocated: 00000000-03-00.0,XPU,24576,25:;
                  hami.io/xpu-devices-to-allocate: 00000000-03-00.0,XPU,24576,25:;
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to rpc error: code = Unknown desc = empty id from pod annotation, which is unexpected
...
Events:
  Type     Reason                    Age   From            Message
  ----     ------                    ----  ----            -------
  Normal   Scheduled                 57s   hami-scheduler  Successfully assigned default/xpu-test-644668fd7b-dgcff to xxxhost.node
  Normal   FilteringSucceed          58s   hami-scheduler  find fit node(xxxhost.node), 0 nodes not fit, 1 nodes fit(xxxhost.node:9.69)
  Normal   BindingSucceed            58s   hami-scheduler  Successfully binding node [xxxhost.node] to default/xpu-test-644668fd7b-dgcff
  Warning  UnexpectedAdmissionError  58s   kubelet         Allocate failed due to rpc error: code = Unknown desc = empty id from pod annotation, which is unexpected

xpu-device-plugin Logs:

I1220 02:51:11.756934     248 server.go:256] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[11-2],},},}
I1220 02:51:11.760858     248 util.go:130] node annotation key is hami.io/mutex.lock, value is 2025-12-20T02:51:11Z,default,xpu-test-644668fd7b-ndgp2 
I1220 02:51:11.762571     248 util.go:351] checklist is [map[XPU:hami.io/xpu-devices-to-allocate]], annos is [map[hami.io/bind-phase:allocating hami.io/bind-time:1766199070 hami.io/vgpu-node:xxxhost.node hami.io/vgpu-time:1766199069 hami.io/xpu-devices-allocated:00000000-03-00.0,XPU,24576,25:; hami.io/xpu-devices-to-allocate:00000000-03-00.0,XPU,24576,25:;]]
I1220 02:51:11.762596     248 util.go:326] Start to decode container device 00000000-03-00.0,XPU,24576,25:
I1220 02:51:11.762603     248 util.go:346] Finished decoding container devices. Total devices: 1
I1220 02:51:11.762609     248 util.go:373] Decoded pod annos poddevices map[XPU:[[{0 00000000:03:00.0 XPU 24576 25}]]]
I1220 02:51:11.762620     248 server_hami.go:171] pod annotation decode vaule is map[XPU:[[{Idx:0 UUID:00000000:03:00.0 Type:XPU Usedmem:24576 Usedcores:25}]]]
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
2025/12/20 02:51:11 has numa is: false, numa node is: 0
I1220 02:51:11.958755     248 xpu.go:193] device set sriov vf num, busID: 00000000:03:00.0, mode: 2, dmemory: 98304, cmemory: 24576, free: %!b(bool=false), split: 4
I1220 02:51:11.958774     248 server_hami.go:144] uuid not match xpuId: 00000000:63:00.0, containerId: 00000000:03:00
I1220 02:51:11.958781     248 server_hami.go:144] uuid not match xpuId: 00000000:65:00.0, containerId: 00000000:03:00
I1220 02:51:11.958786     248 server_hami.go:144] uuid not match xpuId: 00000000:83:00.0, containerId: 00000000:03:00
I1220 02:51:11.958791     248 server_hami.go:144] uuid not match xpuId: 00000000:85:00.0, containerId: 00000000:03:00
I1220 02:51:11.958797     248 server_hami.go:144] uuid not match xpuId: 00000000:A3:00.0, containerId: 00000000:03:00
I1220 02:51:11.967755     248 util.go:200] Node lock released, node: xxxhost.node, err: <nil>

GPU Status:
All eight P800 GPUs appear to be functioning normally.
However, it seems that only 6 virtual devices are being created.

2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 0
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 1
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 2
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 3
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 4
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 5
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 6
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 7
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 8
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 9
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 10
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 11
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 12
2025/12/20 02:54:27 'kunlunxin.com/vxpu' device marked healthy: 13

Pod Resource Request and xpu-smi Output:
The Pod requests only 1 device, but actually seems to have 2 mapped in:

resources:                                                                                                                                                                         
  limits:                                                                                                                                                                          
    kunlunxin.com/vxpu: "1"                                                                                                                                                        
(py38) root@xxxhost:/workspace# xpu-smi 
Sat Dec 20 10:56:03 2025       
+-----------------------------------------------------------------------------+
| XPU-SMI               Driver Version: 5.0.21.40    XPU-RT Version: 5.0.21   |
|-------------------------------+----------------------+----------------------+
| XPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | XPU-Util  Compute M. |
|                               |             L3-Usage |            SR-IOV M. |
|===============================+======================+======================|
|   0  P800 OAM           N/A   | 00000000:05:00.0 N/A |                    0 |
| N/A   41C  N/A     89W / 400W |  48932MiB / 98304MiB |      0%      Default |
|                               |     64MiB /    96MiB |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  P800 OAM           N/A   | 00000000:85:00.0 N/A |                    0 |
| N/A   41C  N/A     87W / 400W |      0MiB / 98304MiB |      0%      Default |
|                               |      0MiB /    96MiB |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  XPU   XI   CI        PID   Type   Process name                  XPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        43      C   ...onda/envs/py38/bin/python    48996MiB |
+-----------------------------------------------------------------------------+

Versions:
hami:v2.7.1
vxpu-device-plugin:v1.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions