Skip to content

Conversation

@gurasinghMS
Copy link
Contributor

Clean cherry pick of PR #2657

This PR fixes a couple of problems with the driver.

1. Namespace persisting and being reused after namespace removal from the controller

Figjamexport2.
It addresses the situation where a namespace is removed from the controller (and therefore from the disk exposed via VTL0), but the driver continues to hold a reference to that namespace. In this state, the driver would return a stale or invalid Namespace object—one that can no longer perform valid I/O.

The approach in this PR is to avoid keeping a strong Arc in the driver. Instead, the driver stores only a Weak for any namespace it hands out. When the disk is dropped, the strong reference disappears, causing the driver’s weak reference to naturally invalidate via normal Arc/Weak semantics. This eliminates the dangling reference issue without requiring additional bookkeeping.

As discussed during office hours, the window for failure with this approach is expected to be extremely narrow, as the driver relies on strict single-ownership of the Namespace object. To further enforce this single-ownership model, this PR introduces a non-cloneable wrapper around Arc. This wrapper exposes the same interface as Namespace and is consumed by the disk, but it prevents cloning of the underlying Arc anywhere along the creation path. This ensures the driver remains the sole authority over Namespace ownership and lifecycle.

2. Namespace being reused after disk removal from VTL2 settings.

image

This is the issue being described in #2656 . This PR brings back the check to verify single-ownership of the Namespace objects. This time around it is possible to verify using the above described Arc/Weak semantics.

… unique ownership for namespaces by nsid (microsoft#2657)

This PR fixes a couple of problems with the driver.
## 1. `Namespace` persisting and being reused after namespace removal
from the controller

![Figjamexport2](https://github.com/user-attachments/assets/3a098676-9c4d-428d-981c-6880ddaa7262).
It addresses the situation where a namespace is removed from the
controller (and therefore from the disk exposed via VTL0), but the
driver continues to hold a reference to that namespace. In this state,
the driver would return a stale or invalid Namespace object—one that can
no longer perform valid I/O.

The approach in this PR is to avoid keeping a strong Arc<Namespace> in
the driver. Instead, the driver stores only a Weak<Namespace> for any
namespace it hands out. When the disk is dropped, the strong reference
disappears, causing the driver’s weak reference to naturally invalidate
via normal Arc/Weak semantics. This eliminates the dangling reference
issue without requiring additional bookkeeping.

As discussed during office hours, the window for failure with this
approach is expected to be extremely narrow, as the driver relies on
strict single-ownership of the Namespace object. To further enforce this
single-ownership model, this PR introduces a non-cloneable wrapper
around Arc<Namespace>. This wrapper exposes the same interface as
Namespace and is consumed by the disk, but it prevents cloning of the
underlying Arc<Namespace> anywhere along the creation path. This ensures
the driver remains the sole authority over Namespace ownership and
lifecycle.

## 2. `Namespace` being reused after disk removal from VTL2 settings.
<img width="7096" height="2979" alt="image"
src="https://github.com/user-attachments/assets/f519db98-447c-4a95-a7a2-c70536e07ae0"
/>

This is the issue being described in microsoft#2656 . This PR brings back the
check to verify single-ownership of the `Namespace` objects. This time
around it is possible to verify using the above described `Arc`/`Weak`
semantics.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Matt LaFayette (Kurjanowicz) <mattkur@microsoft.com>
Co-authored-by: Alex Landau <alexlandau@microsoft.com>
(cherry picked from commit 34cc2bb)
@gurasinghMS gurasinghMS requested review from a team as code owners January 27, 2026 00:00
Copilot AI review requested due to automatic review settings January 27, 2026 00:00
@github-actions github-actions bot added the release_1.7.2511 Targets the release/1.7.2511 branch. label Jan 27, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical namespace lifecycle issues in the NVMe driver by implementing Arc/Weak reference semantics to enforce unique ownership guarantees for namespace objects.

Changes:

  • Introduces NamespaceHandle wrapper to prevent cloning of namespace Arc references, enforcing single-ownership semantics
  • Implements WeakOrStrong<T> enum in the driver to manage namespace references during normal operation (Weak) vs. after restore (Strong)
  • Updates duplicate detection logic to use weak reference upgradeability as the indicator of active use
  • Adds debug logging to improve traceability of namespace lifecycle events
  • Re-enables previously disabled storvsp_dynamic_add_disk test with adjusted parameters

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs Adds NamespaceHandle wrapper, makes check_active public, improves debug logging
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Implements WeakOrStrong enum, refactors namespace management to use weak references, updates duplicate detection logic
vm/devices/storage/disk_nvme/nvme_driver/src/lib.rs Updates public API to export NamespaceHandle instead of Namespace
vm/devices/storage/disk_nvme/src/lib.rs Updates NvmeDisk to use NamespaceHandle
vm/devices/storage/storvsp/src/lib.rs Adds debug logging for rescan notifications
openhcl/underhill_core/src/nvme_manager/*.rs Updates API signatures to use NamespaceHandle
vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs Updates fuzz test to use NamespaceHandle
vmm_tests/vmm_tests/tests/tests/x86_64/storage.rs Re-enables dynamic disk test, adjusts retry parameters

Comment on lines +609 to +613
// If we have a weak ref, make sure it is not upgradeable (that means we have a duplicate somewhere).
let is_weak = namespace.is_weak(); // This value will change after invoking get_arc().
let namespace = namespace.get_arc();
if let Some(namespace) = namespace {
if is_weak && namespace.check_active().is_ok() {
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate detection logic has a flaw when a namespace has been hot-removed. If a namespace is marked as removed (check_active returns Err), but a NamespaceHandle still exists somewhere (weak.upgrade succeeds), this code will return the removed namespace instead of either creating a new one or returning an error. This violates the single-ownership guarantee and can lead to multiple NamespaceHandles referencing a removed namespace.

Consider checking for duplicates regardless of the removed status, or explicitly handling the removed case separately. For example:

if let Some(namespace) = namespace {
    if is_weak {
        // If we can upgrade the weak reference, someone still holds a NamespaceHandle
        return Err(NamespaceError::Duplicate(nsid));
    }
    // Strong reference from restore
    return Ok(NamespaceHandle::new(namespace));
}

Alternatively, if returning a removed namespace is intentional, this behavior should be documented and the implications should be clear.

Suggested change
// If we have a weak ref, make sure it is not upgradeable (that means we have a duplicate somewhere).
let is_weak = namespace.is_weak(); // This value will change after invoking get_arc().
let namespace = namespace.get_arc();
if let Some(namespace) = namespace {
if is_weak && namespace.check_active().is_ok() {
// If we have a weak ref and can still upgrade it, someone else holds a NamespaceHandle
// and we must treat this as a duplicate, regardless of the namespace's active/removed status.
let is_weak = namespace.is_weak(); // This value will change after invoking get_arc().
let namespace = namespace.get_arc();
if let Some(namespace) = namespace {
if is_weak {

Copilot uses AI. Check for mistakes.
Comment on lines +685 to +688
let is_weak = namespace.is_weak(); // This value will change after invoking get_arc().
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()
&& is_weak
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The save logic has a subtle issue: namespaces are only saved if is_weak is true before calling get_arc(). After restore, namespaces start as WeakOrStrong::Strong. If save() is called again before any namespace is retrieved (e.g., during multiple servicing operations in quick succession), none of the restored namespaces will be saved because they're still Strong references.

Consider either:

  1. Removing the && is_weak condition to save all active namespaces regardless of their Strong/Weak state
  2. Adding a comment explaining why namespaces that were restored but never retrieved don't need to be saved
  3. Eagerly downgrading all Strong references to Weak after restore is complete

The current logic could lead to namespace state loss in edge cases where servicing operations occur rapidly.

Suggested change
let is_weak = namespace.is_weak(); // This value will change after invoking get_arc().
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()
&& is_weak
if let Some(ns) = namespace.get_arc()
&& ns.check_active().is_ok()

Copilot uses AI. Check for mistakes.
io_issuers: Arc<IoIssuers>,
#[inspect(skip)]
rescan_notifiers: Arc<RwLock<HashMap<u32, mesh::Sender<()>>>>,
/// NVMe namespaces associated with this driver. Mapping nsid to NamespaceHandle.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Mapping nsid to NamespaceHandle" but the actual type is HashMap<u32, WeakOrStrong<Namespace>>. The comment should be updated to accurately reflect the implementation, e.g., "NVMe namespaces associated with this driver. Maps nsid to WeakOrStrong."

Suggested change
/// NVMe namespaces associated with this driver. Mapping nsid to NamespaceHandle.
/// NVMe namespaces associated with this driver. Maps nsid to WeakOrStrong<Namespace>.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_1.7.2511 Targets the release/1.7.2511 branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants