Skip to content

Conversation

@sshardool
Copy link
Contributor

Tracking issue

Fixes #6873

Why are the changes needed?

Problem: In deeply nested subworkflow executions, workflow nodes can get stuck or experience significant slowdowns due to:

Redundant evaluations: The same workflow node being evaluated multiple times within a single execution round when it's already running
Parallelism bottlenecks: Already-running workflow nodes (subworkflows/launchplans) being blocked by max parallelism limits, preventing progress on child nodes even though the parent workflow has already consumed its parallelism slot.

Impact:

  • Prevents workflows from getting stuck in nested subworkflow and launchplans scenarios
  • Improves execution throughput for workflows with many nested reference launchplans

What changes were proposed in this pull request?

1. Visited Nodes Tracking:

  • Added VisitedNodes interface to ControlFlow in ExecutionContext
  • Tracks workflow nodes using fully-qualified unique IDs (generated via common.GenerateUniqueID)
  • Shared across entire execution tree (all nested subworkflows share the same ControlFlow)
  • Only workflow nodes (NodeKindWorkflow) are tracked; task nodes are excluded
  • Nodes are added to visited set after successful handling
  • Already-visited nodes return NodeStatusRunning without re-evaluation

2. Configurable Parallelism Bypass:

  • New config: node-config.bypass-parallelism-check-for-workflow-nodes (default: false)
  • When enabled and max parallelism is reached:
    • Workflow nodes in Running, Queued, Succeeding, or Failing phases bypass the check
    • Nodes in NotYetStarted phase are still blocked (must wait for parallelism slot)
    • Task and array nodes remain subject to parallelism limits
  • Config is logged once at executor initialization for visibility

3. Additional Fix:

  • Fixed error swallowing in RecursiveNodeHandler where return status, nil should have been return status, err

How was this patch tested?

Multiple reproduction scenarios are included in the draft PR, all of which resulted in stuck workflow node. These test workflows also expose the issues with the max-parallelism handling overall.

Stuck workflows recovered immediately after deployment and newly launched flows are not stuck any more since the fix has been deployed.
Several unit tests were added as well.

Labels

Please add one or more of the following labels to categorize your PR:

This is important to improve the readability of release notes.

Setup process

Example workflows that be be registered and executed with pyflyte:
https://github.com/sshardool/flyte/blob/4feb41fd0036dde78009bc9962ef269546e54f08/workflows/max-parallel/max_parallelism_workflows.py

Screenshots

Screenshots included in #6873

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

…elism limits

Adjust task resources to work with local k3d
…rkflow nodes (flyteorg#421)

* Add visited nodes tracking and parallelism bypass for workflow nodes in propeller

This commit introduces two key changes to FlytePropeller's node execution:

1. Visited Nodes Tracking:
   - Add VisitedNodes interface and implementation to ExecutionContext
   - Track workflow nodes (subworkflows/launchplans) using fully-qualified
     unique IDs to prevent redundant evaluations in nested workflows
   - Shared across all nested subworkflows within an execution tree

2. Configurable Parallelism Bypass:
   - Add `bypass-parallelism-check-for-workflow-nodes` config option
   - Allow already-running workflow nodes to bypass max parallelism limits
   - Only applies to workflow nodes in non-NotYetStarted phases
   - Defaults to false for backward compatibility

Changes:
- flytepropeller/pkg/controller/executors/execution_context.go
- flytepropeller/pkg/controller/config/config.go
- flytepropeller/pkg/controller/config/config_flags.go
- flytepropeller/pkg/controller/nodes/executor.go
- flytepropeller/pkg/controller/executors/mocks/execution_context.go

Tests:
- Add comprehensive unit tests for visited nodes tracking
- Add unit tests for parallelism bypass behavior
- Add unit tests for ControlFlow and VisitedNodes components
- All existing tests passing

* Add a comment and fix type in log message

Fix test compatibility issues after cherry-picking fix

- Update mock method syntax to current codebase conventions
- Remove incompatible TestNodeExecutor_RetriesExhaustedErrorKindPreservation test
- All controller tests now passing

Fix comments and formatting
@sshardool sshardool changed the title Fork multifix Add visited nodes tracking and parallelism bypass for subworkflow nodes Jan 28, 2026
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 82.69231% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.99%. Comparing base (575c0af) to head (249f0f8).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
flytepropeller/pkg/controller/nodes/executor.go 78.04% 4 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6875      +/-   ##
==========================================
+ Coverage   56.96%   56.99%   +0.03%     
==========================================
  Files         929      929              
  Lines       58152    58195      +43     
==========================================
+ Hits        33125    33170      +45     
+ Misses      21985    21981       -4     
- Partials     3042     3044       +2     
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.10% <ø> (-0.04%) ⬇️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.02% <ø> (+0.01%) ⬆️
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.13% <ø> (ø)
unittests-flytepropeller 53.79% <82.69%> (+0.16%) ⬆️
unittests-flytestdlib 63.29% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Fix spelling: inidicator -> indicator
- Fix spelling: potentally -> potentially
- Regenerate config_flags_test.go for new bypass-parallelism-check-for-workflow-nodes flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Flyte subworkflow nodes can be stuck even after referenced launchplans executions are complete

1 participant