-
Notifications
You must be signed in to change notification settings - Fork 777
Clear past errors from workflow state #4624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Clear past errors from workflow state #4624
Conversation
62f6154 to
2b76a91
Compare
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
This reverts commit dab428d. Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
…rrors Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
2b76a91 to
216e1a3
Compare
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
852f527 to
6c5650c
Compare
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
f7c5f80 to
2cea075
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4624 +/- ##
==========================================
- Coverage 58.20% 58.07% -0.14%
==========================================
Files 626 476 -150
Lines 53800 38119 -15681
==========================================
- Hits 31316 22138 -9178
+ Misses 19976 14056 -5920
+ Partials 2508 1925 -583
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
It looks like I'm missing a little bit of code coverage. I'll try to find some time to fix that. |
hamersaw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0 and n1) if the first time we iterate over these the order is n0, n1 then we clear the error from n0 if the second time we iterate n1, n0 then we clear the error from n0 and just cleared all of our errors.
| // Keep track of the last failed state in the loop since it'll be the one to return. | ||
| // TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one. | ||
| if executableNodeStatusOnComplete != nil { | ||
| c.nodeExecutor.Clear(executableNodeStatusOnComplete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add the enableCRDebugMetadata bool argument here to the ClearExecutionError function on the MutableNodeStatus interface. Then this call more closely reflects the UpdatePhase call above and we can remove adding a Clear function to the nodeExecutor struct and similarly the NodeExecutor interface. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used golang before I started using flyte, so I have little opinion on how the interfaces are organised. I will try to implement it as you suggested.
That's a good question. I guess I assumed that |
|
Unfortunately I've been very distracted from this recently but I do plan to come back to it. |
|
Cleaning stale PRs. Please reopen if you wan to discuss this further. |
Tracking issue
#4569
Why are the changes needed?
Reduce un-needed information stored in etcd when using
failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. This allows flyte to scale to larger workflows before hitting etcd size limits.What changes were proposed in this pull request?
node-config.enable-cr-debug-metadataconfig option. Set this to true to restore the previous behaviour.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE. Without this the workflow will fail as soon as there is one failure so there can never be more than one error regardless of this PR.enable-cr-debug-metadataconfig option.TestWorkflowExecutor_HandleFlyteWorkflow_Failingto have sub test cases for combinations ofFAIL_AFTER_EXECUTABLE_NODES_COMPLETEandenable-cr-debug-metadataHow was this patch tested?
Updated unittests
I have been running something very similar to this in our prod deployment for some time.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Follow up to #4596
Docs link