-
Notifications
You must be signed in to change notification settings - Fork 33
SynaXG plugin dev scheme review #626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,27 +22,124 @@ import ( | |
|
|
||
| // EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN! | ||
| // NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized. | ||
| // ========== Add: Define supported DPU operation types ========== | ||
| type DpuOperationType string | ||
|
|
||
| const ( | ||
| // DpuOpNone No operation (default) | ||
| DpuOpNone DpuOperationType = "None" | ||
| // DpuOpFirmwareUpgrade Firmware upgrade operation | ||
| DpuOpFirmwareUpgrade DpuOperationType = "FirmwareUpgrade" | ||
| // DpuOpRestart DPU restart operation (mandatory after firmware upgrade) | ||
| DpuOpRestart DpuOperationType = "Reboot" | ||
| ) | ||
|
|
||
| // ========== Add: Define firmware types ========== | ||
| type DpuFirmwareType string | ||
|
|
||
| const ( | ||
| // DpuFirmwareTypeOAM OAM type firmware | ||
| DpuFirmwareTypeOAM DpuFirmwareType = "OAM" | ||
| // DpuFirmwareTypeSDK SDK type firmware | ||
| DpuFirmwareTypeSDK DpuFirmwareType = "SDK" | ||
| ) | ||
|
|
||
| type DpuOperationStatusPhase string | ||
|
|
||
| const ( | ||
| // DpuPhasePending Operation pending execution (default) | ||
| DpuPhasePending DpuOperationStatusPhase = "Pending" | ||
| // DpuPhaseRunning Operation is in progress | ||
| DpuPhaseRunning DpuOperationStatusPhase = "Running" | ||
| // DpuPhaseSucceeded Operation completed successfully | ||
| DpuPhaseSucceeded DpuOperationStatusPhase = "Succeeded" | ||
| // DpuPhaseFailed Operation execution failed | ||
| DpuPhaseFailed DpuOperationStatusPhase = "Failed" | ||
| // DpuPhaseCancelled Operation was cancelled | ||
| DpuPhaseCancelled DpuOperationStatusPhase = "Cancelled" | ||
| ) | ||
|
|
||
| // ========== Add: Define OAM firmware configuration ========== | ||
| type DpuFirmwareSpec struct { | ||
| // Firmware type (OAM/SDK) | ||
| // +kubebuilder:validation:Required | ||
| // +kubebuilder:validation:Enum=OAM;SDK | ||
| Type DpuFirmwareType `json:"type"` | ||
|
|
||
| // Target firmware version number, required | ||
| // +kubebuilder:validation:Required | ||
| TargetVersion string `json:"targetVersion"` | ||
|
|
||
| // Firmware image path/package path (e.g. /quay.io/openshift/firmware/dpu:v1.0.8) | ||
| // +kubebuilder:validation:Required | ||
| FirmwarePath string `json:"firmwarePath,omitempty"` | ||
|
|
||
| } | ||
|
|
||
| type DataProcessingUnitManagement struct{ | ||
| // DPU operation type to execute: None/FirmwareUpgrade/Restart | ||
| // Modify this field to trigger the corresponding operation!!! | ||
| Operation DpuOperationType `json:"operation,omitempty"` | ||
|
|
||
| // Detailed configuration for firmware upgrade, required when Operation is upgrade type | ||
| // +kubebuilder:validation:RequiredWhen=Operation,FirmwareUpgrade | ||
| Firmware DpuFirmwareSpec `json:"firmware,omitempty"` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Configuration for fw upgrade? That's new to me. Please explain the situation where that would be useful.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take the SynaXG card as an example: the card contains an internal SDK. To upgrade this SDK, we want to avoid manual intervention and instead automate the process via an Operator. It is important to note that the SynaXG card does not run Kubernetes or OpenShift (OCP); instead, it runs OAM (Operations, Administration, and Maintenance) software, which communicates with the SynaXG VSP through gRPC. The SynaXG VSP downloads the SDK image and transmits it to the SynaXG card via gRPC, where the local OAM software then executes the SDK installation.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account. Are we talking about custom firmware? Where is the firmware released?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
CustomSoftware on card is just a gRPC server. vsp has a gRPC client(connected with the server on card). This means host side and card side are compatible.
Yes. we released the firmware as a docker image here:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are "supporting SynaXG" from the host side, it means we need to know what firmware is on the card (not what it does, only that it's a specific version). Can we extend the API to report the firmware version as a string?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate on the specific concerns you have regarding this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we simply allow an opaque blob to configure the card, we don't have any way to have coverage for this. It opens the door to allow any firmware and any configuration to be programmed. At least, we should have some check somewhere that ties down what is allowed in OCP. This should hold even if it's on the card, since proper functioning of the OCP deployment depends on what's running on the card. We want to manually control what is allowed. If you have a specific FW version (the ones you linked on quay), we need the code to check if it's a good firmware to use. This way, our QE knows what to test.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hear your concerns. Do you have a specific approach in mind for validating the firmware's reliability? This is particularly relevant for the SynaXG plugin, as we may need to adjust our code accordingly. Any proposals or advice you could share would be very helpful for our next steps.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need an API from the agnostic part down that gets the firmware version through the GRPC API. And another call to the vsp that returns a list of supported fw versions. Then we can compare if the DPU returned a version that's in the list. |
||
|
|
||
| } | ||
| // DataProcessingUnitConfigSpec defines the desired state of DataProcessingUnitConfig. | ||
| type DataProcessingUnitConfigSpec struct { | ||
| // INSERT ADDITIONAL SPEC FIELDS - desired state of cluster | ||
| // Important: Run "make" to regenerate code after modifying this file | ||
|
|
||
| // DpuSelector specifies which DPUs this DpuConfig CR should target. | ||
| // If empty, the DpuConfig will target all DPUs. | ||
| // +optional | ||
| //matchLabels: nodeName, pci-address | ||
| //pci-address is required. 1 node might have multiple DPUs of the same vendor. | ||
| //so the specify the target DPU, pci-address is necessary | ||
| DpuSelector *metav1.LabelSelector `json:"dpuSelector,omitempty"` | ||
|
|
||
| // Foo is an example field of DataProcessingUnitConfig. Edit dataprocessingunitconfig_types.go to remove/update | ||
| Foo string `json:"foo,omitempty"` | ||
| //each DPU has 1 specific CR | ||
| DpuManagement DataProcessingUnitManagement `json:"dpuManagement,omitempty"` | ||
| } | ||
|
|
||
| // DataProcessingUnitConfigStatus defines the observed state of DataProcessingUnitConfig. | ||
| type DataProcessingUnitConfigStatus struct { | ||
| // DpuNodeOperationStatus defines the observed state of DataProcessingUnitConfig. | ||
| type DpuNodeOperationStatus struct { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please make sure that we can use
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean that the update logic for the Phase field must be highly reliable? In other words, it must accurately reflect the actual real-time status of the operation?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. We have the right infrastructure in place with APIs to ensure that we know when the card is up. We already ping the card from the host, and we track the state of the success of the ping. You could hook into that logic. |
||
| // INSERT ADDITIONAL STATUS FIELD - define observed state of cluster | ||
| // Important: Run "make" to regenerate code after modifying this file | ||
| NodeName string `json:"nodeName"` | ||
|
|
||
| // PciAddress of the target DPU device on this node. | ||
| // +optional | ||
| PciAddress string `json:"pciAddress,omitempty"` | ||
|
|
||
| // Sub-operation type: distinguish between FirmwareUpgrade and Restart | ||
| SubOperation DpuOperationType `json:"subOperation"` | ||
|
|
||
| // Firmware type (valid only when SubOperation is FirmwareUpgrade): OAM/SDK | ||
| FirmwareType DpuFirmwareType `json:"firmwareType,omitempty"` | ||
|
|
||
| // Operation execution status: Pending/Running/Succeeded/Failed | ||
| Phase cv `json:"phase"` | ||
|
|
||
| // Operation start time | ||
| StartTime *metav1.Time `json:"startTime,omitempty"` | ||
|
|
||
| // Operation completion time | ||
| CompletionTime *metav1.Time `json:"completionTime,omitempty"` | ||
|
|
||
| // Upgrade-related versions (valid only when SubOperation is FirmwareUpgrade) | ||
| OriginalVersion string `json:"originalVersion,omitempty"` // Version before upgrade | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why omitempty?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This DpuNodeOperationStatus field manages both 'Reboot' and 'Firmware Upgrade' status. Note that OriginVersion is not required for 'Reboot' tasks
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The DPU config CR is user facing, and I'm wondering what this field gives us.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you upgrade the firmware ,this field will tell you the firmware version before upgrade, so that the user can compare it with the targetVersion.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. previousVersion is indeed better. It seems to me that there is always some version running on the card. Can we report the previous version when we've detected it and remove the "omit empty" portion?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no problem |
||
| TargetVersion string `json:"targetVersion,omitempty"` // Target version for upgrade | ||
|
|
||
| // Error message (required when operation fails) | ||
| ErrorMessage string `json:"errorMessage,omitempty"` | ||
| } | ||
|
|
||
| type DataProcessingUnitConfigStatus struct { | ||
| // INSERT ADDITIONAL STATUS FIELD - define observed state of cluster | ||
| // Important: Run "make" to regenerate code after modifying this file | ||
| NodeStatus DpuNodeOperationStatus `json:"nodeStatuses,omitempty"` | ||
| } | ||
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status | ||
| // +kubebuilder:resource:shortName=dpuconfig | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having your firmware inside a container (using it as File System) works for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for the POC provided by you, it works now. We already verfied it in SynaXG-operator v1.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a default firmware version hard-coded in the code. Since we want to ship the whole stack, we need full control of all components including the firmware. Otherwise we can't ensure that it works e2e.