Skip to content

Conversation

@mattkur
Copy link
Contributor

@mattkur mattkur commented Jan 21, 2026

No description provided.

Copilot AI review requested due to automatic review settings January 21, 2026 00:13
@mattkur mattkur requested a review from a team as a code owner January 21, 2026 00:13
@github-actions github-actions bot added the unsafe Related to unsafe code label Jan 21, 2026
@github-actions
Copy link

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This WIP pull request refactors VFIO error handling by moving retry logic from the low-level vfio_sys crate to the higher-level user_driver crate. The changes introduce a VfioErrata enum to classify errors and a VfioDelayWorkaroundConfig helper to centralize retry logic for working around a kernel race condition.

Changes:

  • Introduced VfioErrata enum to identify ENODEV errors requiring the delay workaround
  • Removed driver parameter and built-in retry logic from vfio_sys::Group::open_device and set_keep_alive methods
  • Added VfioDelayWorkaroundConfig helper with retry_on_enodev and retry_unconditional methods
  • Applied retry logic to open_noiommu operation in VfioDevice::restore

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
vm/devices/user_driver/vfio_sys/src/lib.rs Adds VfioErrata enum for error classification; removes driver parameter and retry logic from open_device and set_keep_alive; removes unused async driver imports
vm/devices/user_driver/src/vfio.rs Adds VfioDelayWorkaroundConfig helper struct with retry methods; applies retry_unconditional to open_noiommu; updates method signatures to match vfio_sys changes

Comment on lines +259 to +276
// There is a small race window in the 6.1 kernel between when the
// vfio device is visible to userspace, and when it is added to its
// internal list. Try one more time on ENODEV failure after a brief
// sleep.
pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T>
where
F: AsyncFnMut(bool) -> anyhow::Result<T>,
{
match f(false).await {
Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => {
tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay");
self.sleep().await;
f(true).await
}
r => r,
}
}

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry_on_enodev method is defined but never used in the code. Either use it to wrap open_device and set_keep_alive calls (which previously had ENODEV retry logic), or remove it if retry_unconditional is the intended approach for all operations.

Suggested change
// There is a small race window in the 6.1 kernel between when the
// vfio device is visible to userspace, and when it is added to its
// internal list. Try one more time on ENODEV failure after a brief
// sleep.
pub async fn retry_on_enodev<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T>
where
F: AsyncFnMut(bool) -> anyhow::Result<T>,
{
match f(false).await {
Err(e) if VfioErrata::from_error(&e) == VfioErrata::DelayWorkaround => {
tracing::warn!("vfio operation {op_name} got ENODEV, retrying after delay");
self.sleep().await;
f(true).await
}
r => r,
}
}

Copilot uses AI. Check for mistakes.
if observed.downcast_ref::<nix::errno::Errno>() == Some(&nix::errno::Errno::ENODEV) {
return VfioErrata::DelayWorkaround;
}
return VfioErrata::None;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.

Suggested change
return VfioErrata::None;
VfioErrata::None

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +97
return VfioErrata::DelayWorkaround;
}
return VfioErrata::None;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary return statement. In Rust, the last expression in a function is automatically returned, so this explicit return is redundant and goes against idiomatic Rust style.

Suggested change
return VfioErrata::DelayWorkaround;
}
return VfioErrata::None;
VfioErrata::DelayWorkaround
} else {
VfioErrata::None
}

Copilot uses AI. Check for mistakes.
Comment on lines +277 to +279
pub async fn retry_unconditional<F, T>(&self, op_name: &str, mut f: F) -> anyhow::Result<T>
where
F: AsyncFnMut(bool) -> anyhow::Result<T>,
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsyncFnMut closure accepts a boolean parameter 'is_retry' but it's not used by the calling code. In the usage at line 131-133, the closure ignores this parameter with '_'. Consider either removing this unused parameter from the trait bound, or documenting its intended purpose if it's meant for future use.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +146
group.set_keep_alive(pci_id).await?;
}
tracing::debug!(pci_id, "about to open device");
let device = group.open_device(pci_id, &driver).await?;
let device = group.open_device(pci_id).await?;
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic has been removed but open_device and set_keep_alive are not being retried in the calling code. According to the PR title "Retry all VFIO operations", these operations should be wrapped with retry logic. Currently, only open_noiommu is being retried (line 130-134), but open_device (line 146) and set_keep_alive (line 143) are called without retry wrappers, even though they originally had ENODEV retry logic that was removed from vfio_sys.

Copilot uses AI. Check for mistakes.
let group_id = vfio_sys::Group::find_group_for_device(&path)?;
let group = vfio_sys::Group::open_noiommu(group_id)?;
let group = workaround
.retry_unconditional("open_noiommu", async |_| {
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry_unconditional method retries on any error, which is overly broad for the stated purpose of working around a specific kernel race condition (ENODEV). This could mask real errors and retry operations that should fail fast. Consider using retry_on_enodev instead, or document why unconditional retry is necessary for this specific operation.

Suggested change
.retry_unconditional("open_noiommu", async |_| {
.retry_on_enodev("open_noiommu", async |_| {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant