feat: restart delay function #41

SimonRichardson · 2025-01-16T15:55:36Z

Allow the runner to back off workers depending on the number of restarts. If a given worker in the runner is starting to thrash, we want to allow the backing off a worker. Because the runner and its workers are longed live (or at least expected to be) the attempts could increase over time, leading to extended back-off times because of the high attempt count. To help solve that, a RestartDelayPeriod has been added, which will reset the attempts to zero if the last attempt is over that restart delay period from the current time. A worker restarting will thus give a timed window of 10 restarts over a given time frame.

Additionally, the last error the worker died with will be useful if there are certain errors that need to be handled differently in terms of restarts.

A follow on PR will be added to expose the metrics of the restarts of a given worker, by name, to identify potential issues.

This will require a v5 worker package, that'll be done if this is approved.

Reworks the model worker manager, so that it doesn't request the services for a model on every restart of the worker. If the worker is thrashing and restarting, we don't want to keep getting the domain services. It causes other parts of the system to thrash. Let the model worker thrash on it's own. It would help if the following worker patch[1] was landed, as we could tell the model worker to back off when this happens. 1. juju/worker#41

Allow the runner to back-off workers depending on the number of restarts. If a given worker in the runner is starting to thrash, we want to allow the backing-off a worker. Because the runner and its workers are longed live (or at least expected to be) the attempts could increase over time, which could lead to extended back-off times. To help solve that, a restarted time is sent, which will be a zero time on first start, but will allow the method to be able to account of the time durring the restart attempts. Additionally, the last error the worker died with will be useful if there are certain errors that need to be handled differently in terms of restarts. A follow on PR will be added to expose the metrics of the restarts of a given worker, by name, to identify potential issues.

The restart delay period should allow the attempt to be reset if the worker is restarted after that time. This fixes the issue where we might see the attempt climbing even though the restarts over a long period of time. This is backwards compatible with the v4 version, as the restart delay is a constant, so the attempt is disguarded.

jameinel

I really like what this is doing, but it would be good to see the parameter names, etc, match between the DependencyEngine code and the Runner code.

jameinel · 2025-11-06T08:41:23Z

runner.go

+		workerInfo.attempts++
+	}
+	delay := workerInfo.restartDelay(workerInfo.attempts, info.err)
+	workerInfo.restarted = now


Is this what we are doing in the dependency engine?

worker/dependency/engine.go

Line 642 in 26162d7

if info.startedTime.IsZero() || timeSinceStarted < engine.config.BackoffResetTime {

Says that we do have a window that we consider this to be active restarts vs a new error.

If we are going with your DelayFunc, maybe we should align that with the DependencyEngine since it is in this same package:

worker/dependency/engine.go

Line 415 in 26162d7

if delay > time.Duration(0) {

We could factor out that code into our preferred DelayFunc and then be passing the function to the DependencyEngine as well as the Runner.

jameinel · 2025-11-06T08:42:33Z

runner_test.go

 		Name:         "test",
 		IsFatal:      noneFatal,
-		RestartDelay: time.Second,
+		RestartDelay: constDelay(time.Second),


Do we want to reuse the parameter RestartDelay and have to bump the version, or would we want to add a RestartDelayFunc parameter and then we don't break it in-place?

SimonRichardson mentioned this pull request Aug 6, 2025

feat: objectstore drainer juju/juju#20222

Merged

SimonRichardson added 2 commits November 5, 2025 21:10

SimonRichardson force-pushed the restart-delay branch from 8009563 to 32f8320 Compare November 5, 2025 21:10

jameinel reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: restart delay function #41

feat: restart delay function #41

Uh oh!

SimonRichardson commented Jan 16, 2025 •

edited

Loading

Uh oh!

jameinel left a comment

Uh oh!

jameinel Nov 6, 2025

Uh oh!

jameinel Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: restart delay function #41

Are you sure you want to change the base?

feat: restart delay function #41

Uh oh!

Conversation

SimonRichardson commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameinel left a comment

Choose a reason for hiding this comment

Uh oh!

jameinel Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

jameinel Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SimonRichardson commented Jan 16, 2025 •

edited

Loading