openhcl: trigger panic on servicing timeout #2672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

chris-oo wants to merge 2 commits into microsoft:main from chris-oo:livedump-servicing-timeout

Member

chris-oo commented Jan 22, 2026

We have noticed what seems to be hang in certain servicing operations, without any good way to debug what is hung. Spawn a timeout task in a new executor that will panic if we don't dismiss or finish the servicing operation in the time the host gave as a hint.

chris-oo requested a review from a team as a code owner

January 22, 2026 19:41

Copilot AI review requested due to automatic review settings

January 22, 2026 19:41

Copilot started reviewing on behalf of chris-oo

January 22, 2026 19:42

chris-oo force-pushed the livedump-servicing-timeout branch from 2129cd7 to fedcc52 Compare

January 22, 2026 19:42

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds a timeout watchdog mechanism to detect and panic on hung servicing operations in OpenHCL. The implementation spawns a dedicated thread with its own async executor to ensure the timeout fires even if the main threadpool is blocked.

Changes:

Added a timeout task that panics if servicing operations exceed the host-provided deadline
The timeout is set to 500ms before the deadline to allow time for crash dump transmission
The timeout thread and task are automatically cleaned up when servicing completes or returns

openhcl/underhill_core/src/dispatch/mod.rs Outdated Show resolved Hide resolved


          openhcl: trigger panic on servicing timeout

4635b2c

chris-oo force-pushed the livedump-servicing-timeout branch from fedcc52 to 4635b2c Compare

January 22, 2026 19:53

chris-oo added the backport_1.7.2511 label

mattkur reviewed

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs

Comment on lines 585 to 613

    
                      // Start a servicing timeout thread on its own executor in a new thread.

                      // Do this to avoid any tasks that may block the current threadpool

                      // executor, and drop the task when this function returns which will

                      // dismiss the timeout panic.

                      //

                      // This helps catch issues where save may hang, and allows the existing

                      // machinery to send the dump to the host when this process crashes.

                      // Note that we choose to not use a livedump call like the firmware

                      // watchdog handlers, as we expect any hang to be a fatal error for

                      // OpenHCL, whereas the firmware watchdog is a failure inside the guest,

                      // not necessarily inside OpenHCL.

                      let (_servicing_timeout_thread, driver) =

                          pal_async::DefaultPool::spawn_on_thread("servicing-timeout-executor");

                      let _servicing_timeout = driver.clone().spawn("servicing-timeout-task", async move {

                          let mut timer = pal_async::timer::PolledTimer::new(&driver);

                          // Subtract 500ms from the host provided timeout hint to allow for

                          // time for the dump to be sent to the host before termination.

                          let duration = deadline

                              .checked_duration_since(std::time::Instant::now())

                              .map(|d| d.saturating_sub(Duration::from_millis(500)))

                              .unwrap_or(Duration::from_secs(0));

                          timer.sleep(duration).await;

                          tracing::error!(

                              CVM_ALLOWED,

                              "servicing operation timed out, triggering panic"

                          );

                          panic!("servicing operation timed out");

                      });

Contributor

mattkur Jan 22, 2026

Thanks, this is very helpful!

I think it's worth a discussion of whether this is what we want to use in production. There have been arguments in the past that this should be entirely driven by the root.

Of course, there is a tricky bug I'm chasing down. This will help me in my debugging, so it's a useful primitive.

Member Author

chris-oo Jan 23, 2026

we could instead gate it behind an ENV config, should we go ahead and do that? FWIW, I don't think this is wrong to do all the time.

smalis-msft reviewed

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs Outdated Show resolved Hide resolved

smalis-msft reviewed

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs Outdated

    
                      // watchdog handlers, as we expect any hang to be a fatal error for

                      // OpenHCL, whereas the firmware watchdog is a failure inside the guest,

                      // not necessarily inside OpenHCL.

                      let (_servicing_timeout_thread, driver) =

Contributor

smalis-msft Jan 23, 2026

dropping the thread handle won't like kill the thread or something right?

Member Author

chris-oo Jan 26, 2026 •

edited

Loading

we're going to drop the task as well (we want it to disarm since this function finishing means we didn't hang), so it shouldn't matter.

smalis-msft reviewed

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs Show resolved Hide resolved

smalis-msft reviewed

View reviewed changes

openhcl/underhill_core/src/dispatch/mod.rs Outdated

    
                          // time for the dump to be sent to the host before termination.

                          let duration = deadline

                              .checked_duration_since(std::time::Instant::now())

                              .map(|d| d.saturating_sub(Duration::from_millis(500)))

Contributor

smalis-msft Jan 23, 2026

Are we sure that's long enough? Do we not want a longer timeout just to be safe?

Member Author

chris-oo Jan 23, 2026

you mean to subtract additional time off the host provided one?

Contributor

smmalis37 Jan 23, 2026

I'm just worried this is going to trip in successful but slow servicings.

Member Author

chris-oo Jan 26, 2026

IMO that's still a bug that we should probably know about. If we're not setting the timeout hint larger on larger skus, that's a bug.

Member Author

chris-oo commented Jan 26, 2026

discussed with bhargav offline, but we should use 500ms as the default, but allow setting the time (or disabling) via ENV arg, will do that in the next revision


          feedback, env var for dump collection time

29bfcb5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

mattkur mattkur left review comments

smalis-msft smalis-msft left review comments

Copilot code review Copilot Copilot left review comments

+1 more reviewer

smmalis37 smmalis37 left review comments

At least 1 approving review is required to merge this pull request.

Labels

backport_1.7.2511