Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 102 additions & 5 deletions api/v1/dataprocessingunitconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,27 +22,124 @@ import (

// EDIT THIS FILE! THIS IS SCAFFOLDING FOR YOU TO OWN!
// NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized.
// ========== Add: Define supported DPU operation types ==========
type DpuOperationType string

const (
// DpuOpNone No operation (default)
DpuOpNone DpuOperationType = "None"
// DpuOpFirmwareUpgrade Firmware upgrade operation
DpuOpFirmwareUpgrade DpuOperationType = "FirmwareUpgrade"
// DpuOpRestart DPU restart operation (mandatory after firmware upgrade)
DpuOpRestart DpuOperationType = "Reboot"
)

// ========== Add: Define firmware types ==========
type DpuFirmwareType string

const (
// DpuFirmwareTypeOAM OAM type firmware
DpuFirmwareTypeOAM DpuFirmwareType = "OAM"
// DpuFirmwareTypeSDK SDK type firmware
DpuFirmwareTypeSDK DpuFirmwareType = "SDK"
)

type DpuOperationStatusPhase string

const (
// DpuPhasePending Operation pending execution (default)
DpuPhasePending DpuOperationStatusPhase = "Pending"
// DpuPhaseRunning Operation is in progress
DpuPhaseRunning DpuOperationStatusPhase = "Running"
// DpuPhaseSucceeded Operation completed successfully
DpuPhaseSucceeded DpuOperationStatusPhase = "Succeeded"
// DpuPhaseFailed Operation execution failed
DpuPhaseFailed DpuOperationStatusPhase = "Failed"
// DpuPhaseCancelled Operation was cancelled
DpuPhaseCancelled DpuOperationStatusPhase = "Cancelled"
)

// ========== Add: Define OAM firmware configuration ==========
type DpuFirmwareSpec struct {
// Firmware type (OAM/SDK)
// +kubebuilder:validation:Required
// +kubebuilder:validation:Enum=OAM;SDK
Type DpuFirmwareType `json:"type"`

// Target firmware version number, required
// +kubebuilder:validation:Required
TargetVersion string `json:"targetVersion"`

// Firmware image path/package path (e.g. /quay.io/openshift/firmware/dpu:v1.0.8)
// +kubebuilder:validation:Required
FirmwarePath string `json:"firmwarePath,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having your firmware inside a container (using it as File System) works for you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the POC provided by you, it works now. We already verfied it in SynaXG-operator v1.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a default firmware version hard-coded in the code. Since we want to ship the whole stack, we need full control of all components including the firmware. Otherwise we can't ensure that it works e2e.


}

type DataProcessingUnitManagement struct{
// DPU operation type to execute: None/FirmwareUpgrade/Restart
// Modify this field to trigger the corresponding operation!!!
Operation DpuOperationType `json:"operation,omitempty"`

// Detailed configuration for firmware upgrade, required when Operation is upgrade type
// +kubebuilder:validation:RequiredWhen=Operation,FirmwareUpgrade
Firmware DpuFirmwareSpec `json:"firmware,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration for fw upgrade? That's new to me. Please explain the situation where that would be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the SynaXG card as an example: the card contains an internal SDK. To upgrade this SDK, we want to avoid manual intervention and instead automate the process via an Operator.

It is important to note that the SynaXG card does not run Kubernetes or OpenShift (OCP); instead, it runs OAM (Operations, Administration, and Maintenance) software, which communicates with the SynaXG VSP through gRPC. The SynaXG VSP downloads the SDK image and transmits it to the SynaXG card via gRPC, where the local OAM software then executes the SDK installation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

Are we talking about custom firmware? Where is the firmware released?

Copy link
Author

@einsteinXue einsteinXue Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

CustomSoftware on card is just a gRPC server. vsp has a gRPC client(connected with the server on card). This means host side and card side are compatible.
Could this implementation potentially deviate from the intended architecture of the dpu-operator?

Are we talking about custom firmware? Where is the firmware released?

Yes. we released the firmware as a docker image here:
https://quay.io/repository/synaxgcom/sdk-img?tab=tags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are "supporting SynaXG" from the host side, it means we need to know what firmware is on the card (not what it does, only that it's a specific version). Can we extend the API to report the firmware version as a string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on the specific concerns you have regarding this?
Do you mean we shouldn't have 3 fields in DpuFirmwareSpec , instead just a firmware version is enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we simply allow an opaque blob to configure the card, we don't have any way to have coverage for this. It opens the door to allow any firmware and any configuration to be programmed. At least, we should have some check somewhere that ties down what is allowed in OCP. This should hold even if it's on the card, since proper functioning of the OCP deployment depends on what's running on the card.

We want to manually control what is allowed. If you have a specific FW version (the ones you linked on quay), we need the code to check if it's a good firmware to use. This way, our QE knows what to test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear your concerns. Do you have a specific approach in mind for validating the firmware's reliability? This is particularly relevant for the SynaXG plugin, as we may need to adjust our code accordingly. Any proposals or advice you could share would be very helpful for our next steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an API from the agnostic part down that gets the firmware version through the GRPC API. And another call to the vsp that returns a list of supported fw versions. Then we can compare if the DPU returned a version that's in the list.


}
// DataProcessingUnitConfigSpec defines the desired state of DataProcessingUnitConfig.
type DataProcessingUnitConfigSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

// DpuSelector specifies which DPUs this DpuConfig CR should target.
// If empty, the DpuConfig will target all DPUs.
// +optional
//matchLabels: nodeName, pci-address
//pci-address is required. 1 node might have multiple DPUs of the same vendor.
//so the specify the target DPU, pci-address is necessary
DpuSelector *metav1.LabelSelector `json:"dpuSelector,omitempty"`

// Foo is an example field of DataProcessingUnitConfig. Edit dataprocessingunitconfig_types.go to remove/update
Foo string `json:"foo,omitempty"`
//each DPU has 1 specific CR
DpuManagement DataProcessingUnitManagement `json:"dpuManagement,omitempty"`
}

// DataProcessingUnitConfigStatus defines the observed state of DataProcessingUnitConfig.
type DataProcessingUnitConfigStatus struct {
// DpuNodeOperationStatus defines the observed state of DataProcessingUnitConfig.
type DpuNodeOperationStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure that we can use kubectl wait ... to block on firmware upgrades.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the update logic for the Phase field must be highly reliable? In other words, it must accurately reflect the actual real-time status of the operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have the right infrastructure in place with APIs to ensure that we know when the card is up. We already ping the card from the host, and we track the state of the success of the ping. You could hook into that logic.

// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file
NodeName string `json:"nodeName"`

// PciAddress of the target DPU device on this node.
// +optional
PciAddress string `json:"pciAddress,omitempty"`

// Sub-operation type: distinguish between FirmwareUpgrade and Restart
SubOperation DpuOperationType `json:"subOperation"`

// Firmware type (valid only when SubOperation is FirmwareUpgrade): OAM/SDK
FirmwareType DpuFirmwareType `json:"firmwareType,omitempty"`

// Operation execution status: Pending/Running/Succeeded/Failed
Phase cv `json:"phase"`

// Operation start time
StartTime *metav1.Time `json:"startTime,omitempty"`

// Operation completion time
CompletionTime *metav1.Time `json:"completionTime,omitempty"`

// Upgrade-related versions (valid only when SubOperation is FirmwareUpgrade)
OriginalVersion string `json:"originalVersion,omitempty"` // Version before upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why omitempty?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DpuNodeOperationStatus field manages both 'Reboot' and 'Firmware Upgrade' status. Note that OriginVersion is not required for 'Reboot' tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DPU config CR is user facing, and I'm wondering what this field gives us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you upgrade the firmware ,this field will tell you the firmware version before upgrade, so that the user can compare it with the targetVersion.
But the name of this field is not accurate, maybe previousVersion is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previousVersion is indeed better. It seems to me that there is always some version running on the card. Can we report the previous version when we've detected it and remove the "omit empty" portion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem

TargetVersion string `json:"targetVersion,omitempty"` // Target version for upgrade

// Error message (required when operation fails)
ErrorMessage string `json:"errorMessage,omitempty"`
}

type DataProcessingUnitConfigStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file
NodeStatus DpuNodeOperationStatus `json:"nodeStatuses,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:shortName=dpuconfig
Expand Down
14 changes: 7 additions & 7 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,13 +139,13 @@ func main() {
os.Exit(1)
}
}
if err := (&controller.DataProcessingUnitConfigReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "DataProcessingUnitConfig")
os.Exit(1)
}
// if err := (&controller.DataProcessingUnitConfigReconciler{
// Client: mgr.GetClient(),
// Scheme: mgr.GetScheme(),
// }).SetupWithManager(mgr); err != nil {
// setupLog.Error(err, "unable to create controller", "controller", "DataProcessingUnitConfig")
// os.Exit(1)
// }
//+kubebuilder:scaffold:builder

if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
Expand Down
17 changes: 17 additions & 0 deletions dpu-api/api.proto
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ service NetworkFunctionService {
rpc DeleteNetworkFunction(NFRequest) returns (Empty);
}

service DataProcessingUnitManagementService {
rpc DpuRebootFunction(DPUManagementRequest) returns (DPUManagementResponse);
rpc DpuFirmwareUpgradeFunction(DPUManagementRequest) returns (DPUManagementResponse);
}

message InitRequest {
bool dpu_mode = 1;
string dpu_identifier = 2;
Expand Down Expand Up @@ -67,3 +72,15 @@ message PingResponse {
string responder_id = 2;
bool healthy = 3;
}

message DPUManagementRequest{
string nodeName = 1;
string pciAddress = 2;
string firmwareType = 3;
string firmwareImagePath = 4;
}

message DPUManagementResponse{
string status = 1;
string message = 2;
}
Loading