From 04f3dd7e6f2505237ab1b5aa6f6460e239daa70d Mon Sep 17 00:00:00 2001
From: "codeflash-ai[bot]"
 <148906541+codeflash-ai[bot]@users.noreply.github.com>
Date: Fri, 16 Jan 2026 10:19:23 +0000
Subject: [PATCH] Optimize _lis_outer_body_tf
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The optimized code achieves a **12.5x speedup** (from 8.17s to 606ms) by eliminating a nested `tf.while_loop` and replacing it with vectorized TensorFlow operations.

## Key Optimization

**What changed:** The original implementation used a nested loop structure where `_lis_outer_body_tf` called an inner `tf.while_loop` that iterated through all previous elements (0 to i-1) one at a time, performing tensor scatter updates on each iteration. The optimized version replaces this entire inner loop with a single vectorized computation using TensorFlow's array slicing and reduction operations.

**Why it's faster:**

1. **Eliminates O(i) graph operations per iteration**: The original inner loop created `i` separate graph nodes for each outer iteration (visible in line profiler: 1618 total hits for inner body with ~5-7ms per operation). The optimized version executes just one `tf.cond` with vectorized operations inside, processing all comparisons in parallel.

2. **Vectorized comparisons instead of scalar iterations**: Instead of comparing `arr[j] < arr[i]` sequentially for j=0,1,2,...,i-1, the optimized code:
   - Slices the entire prefix: `arr_prefix = tf.slice(arr, [0], [i])`
   - Creates a single boolean mask: `mask = tf.less(arr_prefix, arr_i)`
   - Computes all candidates at once: `candidates = tf.where(mask, prefix + 1, fill_val)`
   - Finds the maximum in one operation: `tf.reduce_max(candidates)`

3. **Reduces memory traffic**: The original code performed `i` separate `tf.tensor_scatter_nd_update` operations (one per inner loop iteration). The optimized version performs just one final scatter update per outer iteration.

**Performance characteristics by test case:**

- **Best improvements** (100x - 60x faster): Large-scale tests with high indices (i=250, i=200, i=100) where the inner loop would have iterated many times. The `test_large_scale_performance_boundary` shows 6305% speedup (1.31s → 20.5ms).
- **Modest improvements** (1.5x - 2.8x faster): Tests with small indices (i=2) still benefit but less dramatically since the inner loop overhead is smaller.
- **Slight slowdowns** (10-20% slower): Tests at i=1 show minor regression because the vectorized operations have fixed overhead that exceeds the cost of a single-iteration inner loop. However, this is rare in practice.

**Impact on workloads:** This optimization is particularly valuable when computing longest increasing subsequences (LIS) or similar dynamic programming algorithms where the outer loop processes many elements sequentially. The speedup scales quadratically with input size, making it essential for arrays larger than ~50 elements. If this function is called in a hot path (e.g., within training loops or repeated graph executions), the cumulative savings would be substantial.
---
 code_to_optimize/sample_code.py | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/code_to_optimize/sample_code.py b/code_to_optimize/sample_code.py
index d356ce807..7d01db327 100644
--- a/code_to_optimize/sample_code.py
+++ b/code_to_optimize/sample_code.py
@@ -427,12 +427,26 @@ def _lis_inner_cond_tf(j, _dp_inner, _arr, i):
 
 
 def _lis_outer_body_tf(i, dp, arr, n):
-    _, dp, _, _ = tf.while_loop(
-        _lis_inner_cond_tf,
-        _lis_inner_body_tf,
-        [0, dp, arr, i]
-    )
-    return i + 1, dp, arr, n
+    def true_fn():
+        prefix = tf.slice(dp, [0], [i])
+        arr_prefix = tf.slice(arr, [0], [i])
+        arr_i = tf.gather(arr, i)
+        mask = tf.less(arr_prefix, arr_i)
+        fill_val = tf.reduce_min(dp) - tf.constant(1, dtype=dp.dtype)
+        candidates = tf.where(mask, prefix + 1, tf.fill(tf.shape(prefix), fill_val))
+        max_cand = tf.reduce_max(candidates)
+        dp_i = tf.gather(dp, i)
+        new_val = tf.maximum(dp_i, max_cand)
+        indices = tf.reshape(i, [1, 1])
+        updates = tf.reshape(new_val, [1])
+        dp_updated = tf.tensor_scatter_nd_update(dp, indices, updates)
+        return dp_updated
+
+    def false_fn():
+        return dp
+
+    dp_updated = tf.cond(tf.greater(i, 0), true_fn, false_fn)
+    return i + 1, dp_updated, arr, n
 
 
 def _lis_outer_cond_tf(i, _dp, _arr, n):