-
Notifications
You must be signed in to change notification settings - Fork 13
feat: restart delay function #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reworks the model worker manager, so that it doesn't request the services for a model on every restart of the worker. If the worker is thrashing and restarting, we don't want to keep getting the domain services. It causes other parts of the system to thrash. Let the model worker thrash on it's own. It would help if the following worker patch[1] was landed, as we could tell the model worker to back off when this happens. 1. juju/worker#41
Reworks the model worker manager, so that it doesn't request the services for a model on every restart of the worker. If the worker is thrashing and restarting, we don't want to keep getting the domain services. It causes other parts of the system to thrash. Let the model worker thrash on it's own. It would help if the following worker patch[1] was landed, as we could tell the model worker to back off when this happens. 1. juju/worker#41
Allow the runner to back-off workers depending on the number of restarts. If a given worker in the runner is starting to thrash, we want to allow the backing-off a worker. Because the runner and its workers are longed live (or at least expected to be) the attempts could increase over time, which could lead to extended back-off times. To help solve that, a restarted time is sent, which will be a zero time on first start, but will allow the method to be able to account of the time durring the restart attempts. Additionally, the last error the worker died with will be useful if there are certain errors that need to be handled differently in terms of restarts. A follow on PR will be added to expose the metrics of the restarts of a given worker, by name, to identify potential issues.
The restart delay period should allow the attempt to be reset if the worker is restarted after that time. This fixes the issue where we might see the attempt climbing even though the restarts over a long period of time. This is backwards compatible with the v4 version, as the restart delay is a constant, so the attempt is disguarded.
8009563 to
32f8320
Compare
jameinel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like what this is doing, but it would be good to see the parameter names, etc, match between the DependencyEngine code and the Runner code.
| workerInfo.attempts++ | ||
| } | ||
| delay := workerInfo.restartDelay(workerInfo.attempts, info.err) | ||
| workerInfo.restarted = now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what we are doing in the dependency engine?
Line 642 in 26162d7
| if info.startedTime.IsZero() || timeSinceStarted < engine.config.BackoffResetTime { |
Says that we do have a window that we consider this to be active restarts vs a new error.
If we are going with your DelayFunc, maybe we should align that with the DependencyEngine since it is in this same package:
Line 415 in 26162d7
| if delay > time.Duration(0) { |
We could factor out that code into our preferred DelayFunc and then be passing the function to the DependencyEngine as well as the Runner.
| Name: "test", | ||
| IsFatal: noneFatal, | ||
| RestartDelay: time.Second, | ||
| RestartDelay: constDelay(time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to reuse the parameter RestartDelay and have to bump the version, or would we want to add a RestartDelayFunc parameter and then we don't break it in-place?
Allow the runner to back off workers depending on the number of restarts. If a given worker in the runner is starting to thrash, we want to allow the backing off a worker. Because the runner and its workers are longed live (or at least expected to be) the attempts could increase over time, leading to extended back-off times because of the high attempt count. To help solve that, a
RestartDelayPeriodhas been added, which will reset the attempts to zero if the last attempt is over that restart delay period from the current time. A worker restarting will thus give a timed window of 10 restarts over a given time frame.Additionally, the last error the worker died with will be useful if there are certain errors that need to be handled differently in terms of restarts.
A follow on PR will be added to expose the metrics of the restarts of a given worker, by name, to identify potential issues.
This will require a v5 worker package, that'll be done if this is approved.