-
Notifications
You must be signed in to change notification settings - Fork 444
feat(scheduler): add node nouse gpuuuid function #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I recreate the pr #1154 to add the sign-off info. |
|
please fix the go-lint |
|
~> make verify
Passed import-aliases verification. I have already fixed all issue. |
|
CC @Shouren |
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Do we need 100% code coverage to new code? |
We recommend focusing on unit test coverage, at least new code needs to be covered |
|
What is the difference with this function https://github.com/Project-HAMi/HAMi/blob/master/examples/nvidia/specify_uuid_not_use.yaml |
|
This function uses on the pod annotation and only works with nvidia gpu. My commit can use on the node annotation and support all current vendors. From excluding scope, this function guarantees that pod avoids use gpu by uuid and the purpose of my commit is that node avoids use gpu by uuid. |
|
Hello, someone can help me to review this pr? |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a feature to disable specific GPUs on a node via annotations, which is a valuable addition. The implementation is generic and supports multiple vendors, which is great.
However, I've found a few issues:
- There's a discrepancy between the PR description and the implementation. The description states that disabled GPUs are marked as fully utilized, while the code removes them from the scheduler's view. Please clarify the intended behavior and update the description if necessary.
- There appears to be a critical compilation error due to the use of undefined constants from the
metaxpackage. - The parsing of UUIDs from annotations could be more robust.
- The unit tests for the new functionality are minimal and should be expanded to cover more scenarios.
I've left detailed comments on these points. Please address them to ensure the quality and correctness of the feature.
pkg/scheduler/nodes.go
Outdated
| var vendorUUIDMap = map[string][]string{ | ||
| nvidia.GPUNoUseUUID: {nvidia.NvidiaGPUDevice}, | ||
| cambricon.MLUNoUseUUID: {cambricon.CambriconMLUDevice}, | ||
| hygon.DCUNoUseUUID: {hygon.HygonDCUDevice}, | ||
| iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice}, | ||
| enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice}, | ||
| mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice}, | ||
| metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice}, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vendorUUIDMap uses constants from various device packages. However, metax.MetaxNoUseUUID and metax.MetaxSGPUDevice do not seem to be defined in pkg/device/metax/device.go. This will cause a compilation error. Please ensure all constants used here are defined in their respective packages.
Similarly, enflame.EnflameNoUseUUID is used, but its definition is not visible in the PR context. Please verify it exists.
docs/config.md
Outdated
| * `nvidia.com/nouse-gpuuuid`: | ||
| String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example ie: "GPU-XXX" only shows a single device ID, but the description says it supports multiple. It would be more helpful to provide an example with multiple IDs to match the description and the Chinese documentation.
| String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-XXX" | |
| String type, supports multiple device IDs. Multiple devices should be separated by comma, ie: "GPU-AAA,GPU-BBB" |
pkg/scheduler/nodes.go
Outdated
| for annokKey, vendor := range vendorUUIDMap { | ||
| klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor) | ||
| if value, ok := nodeInfo.Node.Annotations[annokKey]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the variable name annokKey. It should be annoKey. The log message also contains a typo annokey.
| for annokKey, vendor := range vendorUUIDMap { | |
| klog.V(5).Infof("Current annokey is %s, and vendor is %v", annokKey, vendor) | |
| if value, ok := nodeInfo.Node.Annotations[annokKey]; ok { | |
| for annoKey, vendor := range vendorUUIDMap { | |
| klog.V(5).Infof("Current annoKey is %s, and vendor is %v", annoKey, vendor) | |
| if value, ok := nodeInfo.Node.Annotations[annoKey]; ok { |
| for _, disableGPUUUID := range disableGPUUUIDList { | ||
| disableGPUUUIDVendorMap[disableGPUUUID] = vendor | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsing of comma-separated UUIDs can be made more robust. If the annotation value contains extra spaces or empty parts (e.g., "uuid1, ,uuid2"), strings.Split will produce empty strings in the list, which might lead to unexpected behavior. It's good practice to trim whitespace from each UUID and ignore empty ones.
for _, disableGPUUUID := range disableGPUUUIDList {
if id := strings.TrimSpace(disableGPUUUID); id != "" {
disableGPUUUIDVendorMap[id] = vendor
}
}| func Test_rmDeviceByNodeAnnotation(t *testing.T) { | ||
| id1 := "60151478-4709-4242-a8c1-a944252d194b" | ||
| type args struct { | ||
| nodeInfo *util.NodeInfo | ||
| } | ||
| tests := []struct { | ||
| name string | ||
| args args | ||
| want []util.DeviceInfo | ||
| }{ | ||
| { | ||
| name: "Test remove device", | ||
| args: args{ | ||
| nodeInfo: &util.NodeInfo{ | ||
| Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id1}}}, | ||
| Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}, | ||
| }, | ||
| }, | ||
| want: []util.DeviceInfo{}, | ||
| }, | ||
| { | ||
| name: "Test no removing device", | ||
| args: args{ | ||
| nodeInfo: &util.NodeInfo{ | ||
| Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{"test-key": ""}}}, | ||
| Devices: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}, | ||
| }, | ||
| }, | ||
| want: []util.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}, | ||
| }, | ||
| } | ||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| if got := rmDeviceByNodeAnnotation(tt.args.nodeInfo); !reflect.DeepEqual(got, tt.want) { | ||
| t.Errorf("rmDeviceByNodeAnnotation() = %v, want %v", got, tt.want) | ||
| } | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases for rmDeviceByNodeAnnotation are not comprehensive enough. Please consider adding more tests to cover the following scenarios:
- Multiple UUIDs in the annotation value.
- Multiple devices on the node, with some being removed and some not.
- Multiple annotations for different vendors on the same node.
- Malformed annotation values (e.g., with extra spaces or empty parts like
"uuid1,,uuid2"). - A case where a UUID matches but the device vendor does not.
|
@Shouren Hello, I add node nouse-id docs and unit test case. If you have spare time, please help me to check those. |
pkg/scheduler/nodes.go
Outdated
| "github.com/Project-HAMi/HAMi/pkg/device/nvidia" | ||
| ) | ||
|
|
||
| var vendorUUIDMap = map[string][]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhengW22 It is a map mapping from a key in annotation to vendor Name, the variable name vendorUUIDMap make me confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I change the map name from vendorUUIDMap to vendorNoUseAnnoKeyMap?
pkg/scheduler/nodes.go
Outdated
| } | ||
|
|
||
| func rmDeviceByNodeAnnotation(nodeInfo *util.NodeInfo) []util.DeviceInfo { | ||
| disableGPUUUIDVendorMap := make(map[string][]string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhengW22 I prefer a map[vendorName]map[uuid]bool map which avoiding potential UUID conflict
Do you mean i should delete md change in this project? And I only find English version of the documents in this project https://github.com/Project-HAMi/website, does this mean that I only need to submit the English version of the documentation? |
|
@Shouren Hello, I modify the code logic according to your suggestions. |
@Nimbus318 Can you tell @ZhengW22 how to add Chinese version of the docs in website repo ? |
|
@ZhengW22 You need to update both the corresponding English docs in the docs/userguide directory and the corresponding Chinese docs in the i18n/zh/docusaurus-plugin-content-docs/current/userguide directory. |
|
@Shouren , Hello, I already created the new documents pr in project website. |
@ZhengW22 Please remove the docs in this PR and i will check it later. |
|
@Shouren Hello, I have already the doc file. |
Shouren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Shouren, ZhengW22 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
please rebase the code to pass the CI |
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
…." This reverts commit 8ba1840. Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
|
New changes are detected. LGTM label has been removed. |
Signed-off-by: ZhengW22 <s123zz123@gmail.com>
|
@Shouren @archlitchi Hello, I have rebased the code and fix some problems. |
|
@Shouren Hello, can you help me to review the new code? |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful feature to disable specific GPUs on a node via annotations. The overall approach of filtering out disabled devices from the scheduler's node cache is sound. However, the description of the implementation in the PR body is inaccurate. It states that disabled GPUs are marked as fully utilized, but the code actually removes them from the node's device list. Please update the description to reflect the actual implementation.
I've found a couple of critical bugs. One is that UUIDs from annotations are not correctly trimmed, which would cause the feature to fail with whitespace in the annotation value. Another is a reference to a constant that doesn't seem to be defined, which will cause a compilation failure. I've also provided some suggestions for refactoring to improve code clarity and for adding a test case to improve coverage. Please review the comments.
| iluvatar.IluvatarNoUseUUID: {iluvatar.IluvatarGPUDevice}, | ||
| enflame.EnflameNoUseUUID: {enflame.EnflameGPUDevice}, | ||
| mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice}, | ||
| metax.MetaxNoUseUUID: {metax.MetaxGPUDevice, metax.MetaxSGPUDevice}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant metax.MetaxNoUseUUID is used here, but it does not appear to be defined in the metax package based on the provided files. This will likely cause a compilation error. Please ensure this constant is defined, for example in pkg/device/metax/device.go or a similar file, with a value like "metax-tech.com/nouse-gpuuuid".
| if vendorWithDisableGPUUUIDMap[vendor] == nil { | ||
| newVendorMap := make(map[string]bool) | ||
| newVendorMap[disableGPUUUID] = true | ||
| vendorWithDisableGPUUUIDMap[vendor] = newVendorMap | ||
| } else { | ||
| vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in this section. The disableGPUUUID variable, which may contain leading/trailing whitespace, is used as a map key. The trimmed version id should be used instead to ensure correct matching. Additionally, the map initialization logic can be simplified to be more idiomatic.
if vendorWithDisableGPUUUIDMap[vendor] == nil {
vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
}
vendorWithDisableGPUUUIDMap[vendor][id] = true| for _, d := range nodeInfo.Devices { | ||
| removeFlag := false | ||
| if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok { | ||
| if ok := disableGPUUUIDMap[d.ID]; ok { | ||
| klog.V(5).Infof("Disable gpu uuid is : %s", d.ID) | ||
| removeFlag = true | ||
| } | ||
| } | ||
| if !removeFlag { | ||
| tmp = append(tmp, d) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop for filtering devices can be simplified for better readability. Using a continue statement would make the logic clearer and avoid the removeFlag variable. This also resolves the shadowing of the ok variable, which is a good practice to avoid.
for _, d := range nodeInfo.Devices {
if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
if _, disabled := disableGPUUUIDMap[d.ID]; disabled {
klog.V(5).Infof("Disabling gpu with uuid: %s", d.ID)
continue
}
}
tmp = append(tmp, d)
}| } | ||
| } | ||
|
|
||
| func Test_rmDeviceByNodeAnnotation(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite for rmDeviceByNodeAnnotation is good, but it would be beneficial to add a test case that verifies that UUIDs with leading/trailing whitespace in the annotation value are handled correctly. This would ensure the strings.TrimSpace logic is effective and prevent regressions.
For example:
{
name: "Test remove device with whitespace in annotation",
args: args{
nodeInfo: &device.NodeInfo{
Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: " " + id1 + " "}}},
Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
},
},
want: []device.DeviceInfo{},
},|
@ZhengW22 Could you please take a look at the critical comments in gemini-code-assist? |
What type of PR is this?
What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.
The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.
Which issue(s) this PR fixes:
No.
Special notes for your reviewer:
No.
Does this PR introduce a user-facing change?:
No.