-
Notifications
You must be signed in to change notification settings - Fork 777
Add visited nodes tracking and parallelism bypass for subworkflow nodes #6875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sshardool
wants to merge
7
commits into
flyteorg:master
Choose a base branch
from
sshardool:fork-multifix
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…elism limits Adjust task resources to work with local k3d
…rkflow nodes (flyteorg#421) * Add visited nodes tracking and parallelism bypass for workflow nodes in propeller This commit introduces two key changes to FlytePropeller's node execution: 1. Visited Nodes Tracking: - Add VisitedNodes interface and implementation to ExecutionContext - Track workflow nodes (subworkflows/launchplans) using fully-qualified unique IDs to prevent redundant evaluations in nested workflows - Shared across all nested subworkflows within an execution tree 2. Configurable Parallelism Bypass: - Add `bypass-parallelism-check-for-workflow-nodes` config option - Allow already-running workflow nodes to bypass max parallelism limits - Only applies to workflow nodes in non-NotYetStarted phases - Defaults to false for backward compatibility Changes: - flytepropeller/pkg/controller/executors/execution_context.go - flytepropeller/pkg/controller/config/config.go - flytepropeller/pkg/controller/config/config_flags.go - flytepropeller/pkg/controller/nodes/executor.go - flytepropeller/pkg/controller/executors/mocks/execution_context.go Tests: - Add comprehensive unit tests for visited nodes tracking - Add unit tests for parallelism bypass behavior - Add unit tests for ControlFlow and VisitedNodes components - All existing tests passing * Add a comment and fix type in log message Fix test compatibility issues after cherry-picking fix - Update mock method syntax to current codebase conventions - Remove incompatible TestNodeExecutor_RetriesExhaustedErrorKindPreservation test - All controller tests now passing Fix comments and formatting
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6875 +/- ##
==========================================
+ Coverage 56.96% 56.99% +0.03%
==========================================
Files 929 929
Lines 58152 58195 +43
==========================================
+ Hits 33125 33170 +45
+ Misses 21985 21981 -4
- Partials 3042 3044 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Fix spelling: inidicator -> indicator - Fix spelling: potentally -> potentially - Regenerate config_flags_test.go for new bypass-parallelism-check-for-workflow-nodes flag
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tracking issue
Fixes #6873
Why are the changes needed?
Problem: In deeply nested subworkflow executions, workflow nodes can get stuck or experience significant slowdowns due to:
Redundant evaluations: The same workflow node being evaluated multiple times within a single execution round when it's already running
Parallelism bottlenecks: Already-running workflow nodes (subworkflows/launchplans) being blocked by max parallelism limits, preventing progress on child nodes even though the parent workflow has already consumed its parallelism slot.
Impact:
What changes were proposed in this pull request?
1. Visited Nodes Tracking:
VisitedNodesinterface toControlFlowinExecutionContextcommon.GenerateUniqueID)ControlFlow)NodeStatusRunningwithout re-evaluation2. Configurable Parallelism Bypass:
node-config.bypass-parallelism-check-for-workflow-nodes(default:false)Running,Queued,Succeeding, orFailingphases bypass the checkNotYetStartedphase are still blocked (must wait for parallelism slot)3. Additional Fix:
RecursiveNodeHandlerwherereturn status, nilshould have beenreturn status, errHow was this patch tested?
Multiple reproduction scenarios are included in the draft PR, all of which resulted in stuck workflow node. These test workflows also expose the issues with the max-parallelism handling overall.
Stuck workflows recovered immediately after deployment and newly launched flows are not stuck any more since the fix has been deployed.
Several unit tests were added as well.
Labels
Please add one or more of the following labels to categorize your PR:
bypass-parallelism-check-for-workflow-nodesThis is important to improve the readability of release notes.
Setup process
Example workflows that be be registered and executed with pyflyte:
https://github.com/sshardool/flyte/blob/4feb41fd0036dde78009bc9962ef269546e54f08/workflows/max-parallel/max_parallelism_workflows.py
Screenshots
Screenshots included in #6873
Check all the applicable boxes