Skip to content

Conversation

@SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Jan 16, 2025

Allow the runner to back off workers depending on the number of restarts. If a given worker in the runner is starting to thrash, we want to allow the backing off a worker. Because the runner and its workers are longed live (or at least expected to be) the attempts could increase over time, leading to extended back-off times because of the high attempt count. To help solve that, a RestartDelayPeriod has been added, which will reset the attempts to zero if the last attempt is over that restart delay period from the current time. A worker restarting will thus give a timed window of 10 restarts over a given time frame.

Additionally, the last error the worker died with will be useful if there are certain errors that need to be handled differently in terms of restarts.

A follow on PR will be added to expose the metrics of the restarts of a given worker, by name, to identify potential issues.


This will require a v5 worker package, that'll be done if this is approved.

SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Apr 4, 2025
Reworks the model worker manager, so that it doesn't request the
services for a model on every restart of the worker. If the worker
is thrashing and restarting, we don't want to keep getting the
domain services. It causes other parts of the system to thrash.
Let the model worker thrash on it's own.

It would help if the following worker patch[1] was landed, as we
could tell the model worker to back off when this happens.

 1. juju/worker#41
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Apr 4, 2025
Reworks the model worker manager, so that it doesn't request the
services for a model on every restart of the worker. If the worker
is thrashing and restarting, we don't want to keep getting the
domain services. It causes other parts of the system to thrash.
Let the model worker thrash on it's own.

It would help if the following worker patch[1] was landed, as we
could tell the model worker to back off when this happens.

 1. juju/worker#41
Allow the runner to back-off workers depending on the number of
restarts. If a given worker in the runner is starting to thrash,
we want to allow the backing-off a worker. Because the runner and
its workers are longed live (or at least expected to be) the
attempts could increase over time, which could lead to extended
back-off times. To help solve that, a restarted time is sent, which
will be a zero time on first start, but will allow the method to
be able to account of the time durring the restart attempts.
Additionally, the last error the worker died with will be useful if
there are certain errors that need to be handled differently in
terms of restarts.

A follow on PR will be added to expose the metrics of the restarts
of a given worker, by name, to identify potential issues.
The restart delay period should allow the attempt to be reset if
the worker is restarted after that time. This fixes the issue where
we might see the attempt climbing even though the restarts over a
long period of time.

This is backwards compatible with the v4 version, as the restart
delay is a constant, so the attempt is disguarded.
Copy link
Member

@jameinel jameinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like what this is doing, but it would be good to see the parameter names, etc, match between the DependencyEngine code and the Runner code.

workerInfo.attempts++
}
delay := workerInfo.restartDelay(workerInfo.attempts, info.err)
workerInfo.restarted = now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we are doing in the dependency engine?

if info.startedTime.IsZero() || timeSinceStarted < engine.config.BackoffResetTime {

Says that we do have a window that we consider this to be active restarts vs a new error.

If we are going with your DelayFunc, maybe we should align that with the DependencyEngine since it is in this same package:

if delay > time.Duration(0) {

We could factor out that code into our preferred DelayFunc and then be passing the function to the DependencyEngine as well as the Runner.

Name: "test",
IsFatal: noneFatal,
RestartDelay: time.Second,
RestartDelay: constDelay(time.Second),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to reuse the parameter RestartDelay and have to bump the version, or would we want to add a RestartDelayFunc parameter and then we don't break it in-place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants