-
Notifications
You must be signed in to change notification settings - Fork 163
openhcl: trigger panic on servicing timeout #2672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2129cd7 to
fedcc52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a timeout watchdog mechanism to detect and panic on hung servicing operations in OpenHCL. The implementation spawns a dedicated thread with its own async executor to ensure the timeout fires even if the main threadpool is blocked.
Changes:
- Added a timeout task that panics if servicing operations exceed the host-provided deadline
- The timeout is set to 500ms before the deadline to allow time for crash dump transmission
- The timeout thread and task are automatically cleaned up when servicing completes or returns
fedcc52 to
4635b2c
Compare
| // Start a servicing timeout thread on its own executor in a new thread. | ||
| // Do this to avoid any tasks that may block the current threadpool | ||
| // executor, and drop the task when this function returns which will | ||
| // dismiss the timeout panic. | ||
| // | ||
| // This helps catch issues where save may hang, and allows the existing | ||
| // machinery to send the dump to the host when this process crashes. | ||
| // Note that we choose to not use a livedump call like the firmware | ||
| // watchdog handlers, as we expect any hang to be a fatal error for | ||
| // OpenHCL, whereas the firmware watchdog is a failure inside the guest, | ||
| // not necessarily inside OpenHCL. | ||
| let (_servicing_timeout_thread, driver) = | ||
| pal_async::DefaultPool::spawn_on_thread("servicing-timeout-executor"); | ||
| let _servicing_timeout = driver.clone().spawn("servicing-timeout-task", async move { | ||
| let mut timer = pal_async::timer::PolledTimer::new(&driver); | ||
| // Subtract 500ms from the host provided timeout hint to allow for | ||
| // time for the dump to be sent to the host before termination. | ||
| let duration = deadline | ||
| .checked_duration_since(std::time::Instant::now()) | ||
| .map(|d| d.saturating_sub(Duration::from_millis(500))) | ||
| .unwrap_or(Duration::from_secs(0)); | ||
| timer.sleep(duration).await; | ||
| tracing::error!( | ||
| CVM_ALLOWED, | ||
| "servicing operation timed out, triggering panic" | ||
| ); | ||
| panic!("servicing operation timed out"); | ||
| }); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is very helpful!
I think it's worth a discussion of whether this is what we want to use in production. There have been arguments in the past that this should be entirely driven by the root.
Of course, there is a tricky bug I'm chasing down. This will help me in my debugging, so it's a useful primitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could instead gate it behind an ENV config, should we go ahead and do that? FWIW, I don't think this is wrong to do all the time.
| // watchdog handlers, as we expect any hang to be a fatal error for | ||
| // OpenHCL, whereas the firmware watchdog is a failure inside the guest, | ||
| // not necessarily inside OpenHCL. | ||
| let (_servicing_timeout_thread, driver) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropping the thread handle won't like kill the thread or something right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're going to drop the task as well (we want it to disarm since this function finishing means we didn't hang), so it shouldn't matter.
| // time for the dump to be sent to the host before termination. | ||
| let duration = deadline | ||
| .checked_duration_since(std::time::Instant::now()) | ||
| .map(|d| d.saturating_sub(Duration::from_millis(500))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that's long enough? Do we not want a longer timeout just to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to subtract additional time off the host provided one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just worried this is going to trip in successful but slow servicings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's still a bug that we should probably know about. If we're not setting the timeout hint larger on larger skus, that's a bug.
|
discussed with bhargav offline, but we should use 500ms as the default, but allow setting the time (or disabling) via ENV arg, will do that in the next revision |
We have noticed what seems to be hang in certain servicing operations, without any good way to debug what is hung. Spawn a timeout task in a new executor that will panic if we don't dismiss or finish the servicing operation in the time the host gave as a hint.