⚡️ Speed up function _lis_outer_body_tf by 1,249%
#1082
+20
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 1,249% (12.49x) speedup for
_lis_outer_body_tfincode_to_optimize/sample_code.py⏱️ Runtime :
8.17 seconds→606 milliseconds(best of6runs)📝 Explanation and details
The optimized code achieves a 12.5x speedup (from 8.17s to 606ms) by eliminating a nested
tf.while_loopand replacing it with vectorized TensorFlow operations.Key Optimization
What changed: The original implementation used a nested loop structure where
_lis_outer_body_tfcalled an innertf.while_loopthat iterated through all previous elements (0 to i-1) one at a time, performing tensor scatter updates on each iteration. The optimized version replaces this entire inner loop with a single vectorized computation using TensorFlow's array slicing and reduction operations.Why it's faster:
Eliminates O(i) graph operations per iteration: The original inner loop created
iseparate graph nodes for each outer iteration (visible in line profiler: 1618 total hits for inner body with ~5-7ms per operation). The optimized version executes just onetf.condwith vectorized operations inside, processing all comparisons in parallel.Vectorized comparisons instead of scalar iterations: Instead of comparing
arr[j] < arr[i]sequentially for j=0,1,2,...,i-1, the optimized code:arr_prefix = tf.slice(arr, [0], [i])mask = tf.less(arr_prefix, arr_i)candidates = tf.where(mask, prefix + 1, fill_val)tf.reduce_max(candidates)Reduces memory traffic: The original code performed
iseparatetf.tensor_scatter_nd_updateoperations (one per inner loop iteration). The optimized version performs just one final scatter update per outer iteration.Performance characteristics by test case:
test_large_scale_performance_boundaryshows 6305% speedup (1.31s → 20.5ms).Impact on workloads: This optimization is particularly valuable when computing longest increasing subsequences (LIS) or similar dynamic programming algorithms where the outer loop processes many elements sequentially. The speedup scales quadratically with input size, making it essential for arrays larger than ~50 elements. If this function is called in a hot path (e.g., within training loops or repeated graph executions), the cumulative savings would be substantial.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-_lis_outer_body_tf-mkgq816pand push.