Skip to content

Conversation

@chris-oo
Copy link
Member

We have noticed what seems to be hang in certain servicing operations, without any good way to debug what is hung. Spawn a timeout task in a new executor that will panic if we don't dismiss or finish the servicing operation in the time the host gave as a hint.

@chris-oo chris-oo requested a review from a team as a code owner January 22, 2026 19:41
Copilot AI review requested due to automatic review settings January 22, 2026 19:41
@chris-oo chris-oo force-pushed the livedump-servicing-timeout branch from 2129cd7 to fedcc52 Compare January 22, 2026 19:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a timeout watchdog mechanism to detect and panic on hung servicing operations in OpenHCL. The implementation spawns a dedicated thread with its own async executor to ensure the timeout fires even if the main threadpool is blocked.

Changes:

  • Added a timeout task that panics if servicing operations exceed the host-provided deadline
  • The timeout is set to 500ms before the deadline to allow time for crash dump transmission
  • The timeout thread and task are automatically cleaned up when servicing completes or returns

Comment on lines 585 to 613
// Start a servicing timeout thread on its own executor in a new thread.
// Do this to avoid any tasks that may block the current threadpool
// executor, and drop the task when this function returns which will
// dismiss the timeout panic.
//
// This helps catch issues where save may hang, and allows the existing
// machinery to send the dump to the host when this process crashes.
// Note that we choose to not use a livedump call like the firmware
// watchdog handlers, as we expect any hang to be a fatal error for
// OpenHCL, whereas the firmware watchdog is a failure inside the guest,
// not necessarily inside OpenHCL.
let (_servicing_timeout_thread, driver) =
pal_async::DefaultPool::spawn_on_thread("servicing-timeout-executor");
let _servicing_timeout = driver.clone().spawn("servicing-timeout-task", async move {
let mut timer = pal_async::timer::PolledTimer::new(&driver);
// Subtract 500ms from the host provided timeout hint to allow for
// time for the dump to be sent to the host before termination.
let duration = deadline
.checked_duration_since(std::time::Instant::now())
.map(|d| d.saturating_sub(Duration::from_millis(500)))
.unwrap_or(Duration::from_secs(0));
timer.sleep(duration).await;
tracing::error!(
CVM_ALLOWED,
"servicing operation timed out, triggering panic"
);
panic!("servicing operation timed out");
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is very helpful!

I think it's worth a discussion of whether this is what we want to use in production. There have been arguments in the past that this should be entirely driven by the root.

Of course, there is a tricky bug I'm chasing down. This will help me in my debugging, so it's a useful primitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could instead gate it behind an ENV config, should we go ahead and do that? FWIW, I don't think this is wrong to do all the time.

// watchdog handlers, as we expect any hang to be a fatal error for
// OpenHCL, whereas the firmware watchdog is a failure inside the guest,
// not necessarily inside OpenHCL.
let (_servicing_timeout_thread, driver) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping the thread handle won't like kill the thread or something right?

Copy link
Member Author

@chris-oo chris-oo Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're going to drop the task as well (we want it to disarm since this function finishing means we didn't hang), so it shouldn't matter.

// time for the dump to be sent to the host before termination.
let duration = deadline
.checked_duration_since(std::time::Instant::now())
.map(|d| d.saturating_sub(Duration::from_millis(500)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that's long enough? Do we not want a longer timeout just to be safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to subtract additional time off the host provided one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worried this is going to trip in successful but slow servicings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's still a bug that we should probably know about. If we're not setting the timeout hint larger on larger skus, that's a bug.

@chris-oo
Copy link
Member Author

discussed with bhargav offline, but we should use 500ms as the default, but allow setting the time (or disabling) via ENV arg, will do that in the next revision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants