-
Notifications
You must be signed in to change notification settings - Fork 163
DO NOT MERGE/WIP: Retry all VFIO operations #2665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This WIP pull request refactors VFIO error handling by moving retry logic from the low-level vfio_sys crate to the higher-level user_driver crate. The changes introduce a VfioErrata enum to classify errors and a VfioDelayWorkaroundConfig helper to centralize retry logic for working around a kernel race condition.
Changes:
- Introduced VfioErrata enum to identify ENODEV errors requiring the delay workaround
- Removed driver parameter and built-in retry logic from vfio_sys::Group::open_device and set_keep_alive methods
- Added VfioDelayWorkaroundConfig helper with retry_on_enodev and retry_unconditional methods
- Applied retry logic to open_noiommu operation in VfioDevice::restore
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| vm/devices/user_driver/vfio_sys/src/lib.rs | Adds VfioErrata enum for error classification; removes driver parameter and retry logic from open_device and set_keep_alive; removes unused async driver imports |
| vm/devices/user_driver/src/vfio.rs | Adds VfioDelayWorkaroundConfig helper struct with retry methods; applies retry_unconditional to open_noiommu; updates method signatures to match vfio_sys changes |
| // There is a small race window in the 6.1 kernel between when the | ||
| // vfio device is visible to userspace, and when it is added to its | ||
| // internal list. Try one more time on ENODEV failure after a brief | ||
| // sleep. | ||
| pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | ||
| where | ||
| F: AsyncFnMut(bool) -> anyhow::Result<T>, | ||
| { | ||
| match f(false).await { | ||
| Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => { | ||
| tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay"); | ||
| self.sleep().await; | ||
| f(true).await | ||
| } | ||
| r => r, | ||
| } | ||
| } | ||
|
|
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry_on_enodev method is defined but never used in the code. Either use it to wrap open_device and set_keep_alive calls (which previously had ENODEV retry logic), or remove it if retry_unconditional is the intended approach for all operations.
| // There is a small race window in the 6.1 kernel between when the | |
| // vfio device is visible to userspace, and when it is added to its | |
| // internal list. Try one more time on ENODEV failure after a brief | |
| // sleep. | |
| pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | |
| where | |
| F: AsyncFnMut(bool) -> anyhow::Result<T>, | |
| { | |
| match f(false).await { | |
| Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => { | |
| tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay"); | |
| self.sleep().await; | |
| f(true).await | |
| } | |
| r => r, | |
| } | |
| } |
| if observed.downcast_ref::<nix::errno::Errno>() == Some(&nix::errno::Errno::ENODEV) { | ||
| return VfioErrata::DelayWorkaround; | ||
| } | ||
| return VfioErrata::None; |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.
| return VfioErrata::None; | |
| VfioErrata::None |
| return VfioErrata::DelayWorkaround; | ||
| } | ||
| return VfioErrata::None; |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.
| return VfioErrata::DelayWorkaround; | |
| } | |
| return VfioErrata::None; | |
| VfioErrata::DelayWorkaround | |
| } else { | |
| VfioErrata::None | |
| } |
| pub async fn retry_unconditional<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T> | ||
| where | ||
| F: AsyncFnMut(bool) -> anyhow::Result<T>, |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AsyncFnMut closure accepts a boolean parameter 'is_retry' but it's not used by the calling code. In the usage at line 131-133, the closure ignores this parameter with '_'. Consider either removing this unused parameter from the trait bound, or documenting its intended purpose if it's meant for future use.
| group.set_keep_alive(pci_id).await?; | ||
| } | ||
| tracing::debug!(pci_id, "about to open device"); | ||
| let device = group.open_device(pci_id, &driver).await?; | ||
| let device = group.open_device(pci_id).await?; |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry logic has been removed but open_device and set_keep_alive are not being retried in the calling code. According to the PR title "Retry all VFIO operations", these operations should be wrapped with retry logic. Currently, only open_noiommu is being retried (line 130-134), but open_device (line 146) and set_keep_alive (line 143) are called without retry wrappers, even though they originally had ENODEV retry logic that was removed from vfio_sys.
| let group_id = vfio_sys::Group::find_group_for_device(&path)?; | ||
| let group = vfio_sys::Group::open_noiommu(group_id)?; | ||
| let group = workaround | ||
| .retry_unconditional("open_noiommu", async |_| { |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry_unconditional method retries on any error, which is overly broad for the stated purpose of working around a specific kernel race condition (ENODEV). This could mask real errors and retry operations that should fail fast. Consider using retry_on_enodev instead, or document why unconditional retry is necessary for this specific operation.
| .retry_unconditional("open_noiommu", async |_| { | |
| .retry_on_enodev("open_noiommu", async |_| { |
No description provided.