Skip to content

Conversation

@einsteinXue
Copy link

Hi @bn222 @wizhaoredhat @thom311
I have submitted a pull request for the review of the design scheme. Would you be kind enough to help assess its rationality?

Certain implementation details — including the Dockerfile, Makefile, and SynaXG Plugin logic — have not yet been fully finalized. Please disregard these items for the time being and focus solely on the design scheme. Thank you.

First, I would like to provide some background:
The SynaXG card will not have Kubernetes or OpenShift Container Platform (OCP) installed. As such, the SynaXG VSP will run exclusively on the host side. The VSP’s core functions are twofold:
1) To perform DPU reboot by unbinding the target PCI address.
2) To implement firmware upgrade via gRPC — specifically, a gRPC server runs on the SynaXG card itself, while the corresponding gRPC client is deployed within the VSP pod. (This gRPC is only for firmware upgrade, has nothing to do with dpu-operator)

Let me briefly introduce my idea:
A "DataProcessingUnitConfig" CRD will be created for each DPU.

Daemon pods will be deployed on all DPU nodes in the cluster, meaning "HostSideManagers" will run on every DPU node.

The "DataProcessingUnitConfigReconciler" is configured within the "HostSideManager", so each DPU node can monitor changes to the "DataProcessingUnitConfig" CRD.

When a user adds "nodeName" and specifies "reboot DPU" in the CRD, each "DataProcessingUnitConfigReconciler" will verify whether its node is the target node. If the labels match, the "DataProcessingUnitConfigReconciler" will call the gRPC method to execute the reboot operation.

From my understanding, the gRPC connection is already established by the "HostSideManager"—thus, when the "DataProcessingUnitConfigReconciler" is initialized, "vsp" is passed in as an input parameter.

  1. [api/v1/dataprocessingunitconfig_types.go]: Added CRD API definitions
  2. [cmd/main.go]: Commented out the setup of dataProcessingUnitConfigReconciler (to be initialized in HostSideManager instead)
  3. [dpu-api/api.proto]: Added DataProcessingUnitManagementService to the gRPC API to support DPU reboot and firmware upgrade operations
  4. [internal/controller/dataprocessingunitconfig_controller.go]: Implemented reconciliation logic for the DataProcessingUnitConfig CRD, and executes "reboot" and "firmware upgrade" operations by invoking gRPC methods from the VSP
  5. [internal/daemon/hostsidemanager.go]: Added setup logic for dataProcessingUnitConfigReconciler to monitor changes to the DataProcessingUnitConfig CRD
  6. [internal/daemon/plugin/vendorplugin.go]: Added interfaces for DpuReboot and FirmwareUpgrade operations
  7. [internal/platform/synaxg-dpu.go]: Added a detector for the SynaXG DPU platform
  8. [internal/platform/vendordetector.go]: Integrated the SynaXG platform detector

@openshift-ci
Copy link

openshift-ci bot commented Jan 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: einsteinXue
Once this PR has been reviewed and has the lgtm label, please assign bn222 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 9, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 9, 2026

Hi @einsteinXue. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@einsteinXue einsteinXue marked this pull request as draft January 9, 2026 07:05
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2026
@bn222
Copy link
Contributor

bn222 commented Jan 14, 2026

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2026
@openshift-ci
Copy link

openshift-ci bot commented Jan 14, 2026

@einsteinXue: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/make-prow-ci-manifests-check 3ef8145 link true /test make-prow-ci-manifests-check
ci/prow/make-generate-check 3ef8145 link true /test make-generate-check
ci/prow/verify-deps 3ef8145 link true /test verify-deps
ci/prow/make-vendor-check 3ef8145 link true /test make-vendor-check
ci/prow/make-test 3ef8145 link true /test make-test
ci/prow/make-e2e-test 3ef8145 link true /test make-e2e-test
ci/prow/images 3ef8145 link true /test images
ci/prow/make-e2e-test-ptl 3ef8145 link true /test make-e2e-test-ptl
ci/prow/make-fmt-check 3ef8145 link true /test make-fmt-check
ci/prow/make-e2e-test-marvell 3ef8145 link true /test make-e2e-test-marvell

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@einsteinXue
Copy link
Author

@bn222 some tests were fail.
It appears the current failures are related to the compilation process, as I haven't performed a full build of the project yet. Before I proceed further with development, could you please review the proposed design scheme? If the approach seems sound to you, I will move forward with the implementation.


// Firmware image path/package path (e.g. /quay.io/openshift/firmware/dpu:v1.0.8)
// +kubebuilder:validation:Required
FirmwarePath string `json:"firmwarePath,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having your firmware inside a container (using it as File System) works for you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the POC provided by you, it works now. We already verfied it in SynaXG-operator v1.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a default firmware version hard-coded in the code. Since we want to ship the whole stack, we need full control of all components including the firmware. Otherwise we can't ensure that it works e2e.


// Detailed configuration for firmware upgrade, required when Operation is upgrade type
// +kubebuilder:validation:RequiredWhen=Operation,FirmwareUpgrade
Firmware DpuFirmwareSpec `json:"firmware,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration for fw upgrade? That's new to me. Please explain the situation where that would be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the SynaXG card as an example: the card contains an internal SDK. To upgrade this SDK, we want to avoid manual intervention and instead automate the process via an Operator.

It is important to note that the SynaXG card does not run Kubernetes or OpenShift (OCP); instead, it runs OAM (Operations, Administration, and Maintenance) software, which communicates with the SynaXG VSP through gRPC. The SynaXG VSP downloads the SDK image and transmits it to the SynaXG card via gRPC, where the local OAM software then executes the SDK installation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

Are we talking about custom firmware? Where is the firmware released?

Copy link
Author

@einsteinXue einsteinXue Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host runs OCP, and that means that for it to be compatible with what's on the card even if custom things run on the card. Once the host is implicated in any way, there is at least some level of compatibility that needs to be taken into account.

CustomSoftware on card is just a gRPC server. vsp has a gRPC client(connected with the server on card). This means host side and card side are compatible.
Could this implementation potentially deviate from the intended architecture of the dpu-operator?

Are we talking about custom firmware? Where is the firmware released?

Yes. we released the firmware as a docker image here:
https://quay.io/repository/synaxgcom/sdk-img?tab=tags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are "supporting SynaXG" from the host side, it means we need to know what firmware is on the card (not what it does, only that it's a specific version). Can we extend the API to report the firmware version as a string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on the specific concerns you have regarding this?
Do you mean we shouldn't have 3 fields in DpuFirmwareSpec , instead just a firmware version is enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we simply allow an opaque blob to configure the card, we don't have any way to have coverage for this. It opens the door to allow any firmware and any configuration to be programmed. At least, we should have some check somewhere that ties down what is allowed in OCP. This should hold even if it's on the card, since proper functioning of the OCP deployment depends on what's running on the card.

We want to manually control what is allowed. If you have a specific FW version (the ones you linked on quay), we need the code to check if it's a good firmware to use. This way, our QE knows what to test.

CompletionTime *metav1.Time `json:"completionTime,omitempty"`

// Upgrade-related versions (valid only when SubOperation is FirmwareUpgrade)
OriginalVersion string `json:"originalVersion,omitempty"` // Version before upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why omitempty?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DpuNodeOperationStatus field manages both 'Reboot' and 'Firmware Upgrade' status. Note that OriginVersion is not required for 'Reboot' tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DPU config CR is user facing, and I'm wondering what this field gives us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you upgrade the firmware ,this field will tell you the firmware version before upgrade, so that the user can compare it with the targetVersion.
But the name of this field is not accurate, maybe previousVersion is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previousVersion is indeed better. It seems to me that there is always some version running on the card. Can we report the previous version when we've detected it and remove the "omit empty" portion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem

// DataProcessingUnitConfigStatus defines the observed state of DataProcessingUnitConfig.
type DataProcessingUnitConfigStatus struct {
// DpuNodeOperationStatus defines the observed state of DataProcessingUnitConfig.
type DpuNodeOperationStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure that we can use kubectl wait ... to block on firmware upgrades.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the update logic for the Phase field must be highly reliable? In other words, it must accurately reflect the actual real-time status of the operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have the right infrastructure in place with APIs to ensure that we know when the card is up. We already ping the card from the host, and we track the state of the success of the ping. You could hook into that logic.

@@ -517,6 +517,126 @@ func (x *PingResponse) GetHealthy() bool {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need 2 commits, one for the actual changes, and one for the generated code to make review and history easier. That means one commit won't compile. I know that's painful. We've decided we favor review-ability instead of having being able to compile each commit. Note, we should still be able to compile each PR in its entirety.

type DataProcessingUnitConfigReconciler struct {
client.Client
Scheme *runtime.Scheme
vsp *plugin.VendorPlugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass in a vsp? We are going to use dpu selectors. Please refer to https://docs.google.com/document/d/1niEIreHlMsnFF4Ol2RidvHBuTpXcRphLKGY7Ebtzo-0/edit?tab=t.0 and implement that design.

} else {
//if label not matched, skip processing
log.Info("Labels are not matched, skip processing")
return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the err handling?

}

//f the CR is targeted at DPU rebooting
if dpuConfig.Spec.DpuManagement.Operation == "Reboot" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if


//f the CR is targeted at DPU rebooting
if dpuConfig.Spec.DpuManagement.Operation == "Reboot" {
r.vsp.RebootDpu(r.pciAddr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep level of abstraction as you move deeper in the callstack

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that pciAddr shouldn't be exposed at this level, and instead should be encapsulated within the vsp client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. We don't use pci address to identify cards. We use the DPU CRs (see the doc that I linked earlier) and DPU selectors.


// Extract noodeName and pci-address from selector
if dpuConfig.Spec.DpuSelector != nil {
nodeName, pciAddr, err = GetNodenameAndPCIAddressFromSelector(dpuConfig.Spec.DpuSelector)
Copy link
Contributor

@bn222 bn222 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already do this, why then do we need to pass in vsp in the cr.

General rule: We don't expect the user to repeat themselves. If we can derive fields from other fields, then they will only need to provide the information once.


func (r *DataProcessingUnitConfigReconciler) CheckPCIAddressExists(pciAddr string) (string, error) {
// check /sys/bus/pci/devices/pciAddr
cmd := exec.Command("ls", fmt.Sprintf("/sys/bus/pci/devices/%s", pciAddr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstract all file system access (see tests).

}

func (r *DataProcessingUnitConfigReconciler) getCurrentNodeName() string {
return os.Getenv("MY_NODE_NAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're increasing cohesion within the patch between pieces of code. We don't want some code to depend on the node name and then have other piece of code again depend on node name. Can we please use the serial number concept that we're already using elsewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do it.
BTW: currently dpu-operator only supports 1 DPU per node, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not yet tested multiple DPUs per node. However, in the latest refactor, we have laid the foundation for multiple DPUs.

Without testing, and looking at the code, it's more likely that a node with 2 DPUs from different vendors will work as opposed to a node with 2 DPus from the same vendor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see the current status.
Testing nodes with up to four SynaXG DPUs is a potential requirement for our project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good and most of the groundwork to support that is already here.

return g.dsClient.SetNumVfs(context.Background(), c)
}

func (g *GrpcPlugin) RebootDpu(dmr pb.DPUManagementRequest) (*pb.DPUManagementResponse, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where you're checking the health so that you know you can complete the reboot operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do miss the health check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying you don't know where it is in the code? Or are you saying you will adjust the patch to add it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I will adjust the patch.

}

func (g *GrpcPlugin) UpgradeFirmware(dmr pb.DPUManagementRequest) (*pb.DPUManagementResponse, error) {
err := g.ensureConnected()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where you're checking the health so that you know you can complete the update operation.

pathManager utils.PathManager
stopRequested bool
dpListener net.Listener
dpuMgrClient pb2.DataProcessingUnitManagementServiceClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dpuManagerClient.

// FIXME: Must be a unique value on the DPU that is non changing.
func (d *SynaXGDetector) DpuPlatformIdentifier(platform Platform) (plugin.DpuIdentifier, error) {
return plugin.DpuIdentifier("SynaXG-dpu"), nil
}
Copy link
Contributor

@bn222 bn222 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test. Patch one: add the SynaXG detector (if we really need it) + test, Patch 2: rest.

Copy link
Contributor

@bn222 bn222 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@einsteinXue
Copy link
Author

General idea is ok. Make sure to align with https://docs.google.com/document/d/1niEIreHlMsnFF4Ol2RidvHBuTpXcRphLKGY7Ebtzo-0/edit?tab=t.0

Thank you so much! I will go through your comments one by one and address them accordingly.

@bn222
Copy link
Contributor

bn222 commented Jan 23, 2026

Thank you. Please make the PR also pass most of the CI lanes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants