⚡️ Speed up function `_lis_outer_body_jax` by 6,122% #1077

codeflash-ai · 2026-01-16T06:57:30Z

📄 6,122% (61.22x) speedup for `_lis_outer_body_jax` in `code_to_optimize/sample_code.py`

⏱️ Runtime : 4.09 seconds → 65.8 milliseconds (best of 14 runs)

📝 Explanation and details

The optimized code achieves a 62x speedup (6121%) by adding @jit decorators to both _lis_inner_body_jax and _lis_outer_body_jax functions. This simple change enables JAX's Just-In-Time compilation, which fundamentally transforms how the code executes.

What changed:

Added @jit decorator to both functions
Added jit to the imports from jax

Why this makes the code faster:

Eliminates Python interpreter overhead: Without JIT, each array operation (arr[j], dp_inner[j], comparisons, etc.) triggers Python function calls and type checks. The line profiler shows the original _lis_inner_body_jax spent 0.537s on just 42 iterations. With JIT, these operations are compiled once into optimized machine code.
Enables operation fusion: JAX's compiler can fuse the sequence of operations in _lis_inner_body_jax (comparison → logical AND → jnp.where → array update) into a single optimized kernel, eliminating intermediate array allocations and memory transfers.
Optimizes the hot loop: The original line profiler shows lax.fori_loop taking 5.52s (100% of _lis_outer_body_jax time). With JIT, JAX optimizes the entire loop body, including the partial function application, into efficient compiled code that runs directly on the accelerator (GPU/TPU) or CPU without Python overhead.
Amortizes compilation cost: The first call compiles the function (visible in the ~20-110ms range for first calls in tests), but subsequent calls with same-shaped inputs reuse the compiled version. This is why tests show speedups from 1034% (large arrays) to 48000% (small arrays) - smaller inputs benefit more from eliminating per-call overhead.

Performance characteristics based on test results:

Small arrays (2-10 elements): 40,000-48,000% speedup - compilation overhead is tiny compared to per-call Python overhead savings
Medium arrays (100-200 elements): 2,500-4,800% speedup - good balance between compilation benefit and workload
Large arrays (500 elements): 1,034-2,562% speedup - computation time dominates, but still significant gains from fused operations

Impact on workloads:
Since this appears to be implementing a longest increasing subsequence (LIS) dynamic programming algorithm, the optimization would be particularly beneficial for:

Repeated LIS computations on similar-sized arrays (compilation happens once)
Batch processing scenarios where the function is called many times
Real-time applications where sub-millisecond latency matters

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 42 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

from typing import List

import jax
import jax.numpy as jnp

# imports
from code_to_optimize.sample_code import _lis_outer_body_jax


# function to test
# (copied exactly so tests exercise the real implementation)
def _lis_inner_body_jax(j, dp_inner, arr, i):
    condition = (arr[j] < arr[i]) & (dp_inner[j] + 1 > dp_inner[i])
    new_val = jnp.where(condition, dp_inner[j] + 1, dp_inner[i])
    return dp_inner.at[i].set(new_val)


# Helper used by tests: pure-Python computation of expected dp after processing index i.
def compute_expected_dp_for_i(i: int, dp_list: List[int], arr_list: List[int]) -> List[int]:
    # Make a copy so we don't mutate caller's input
    expected = list(dp_list)
    # For each j in 0..i-1, update expected[i] if arr[j] < arr[i] and dp[j]+1 > expected[i]
    for j in range(i):
        if arr_list[j] < arr_list[i]:
            candidate = expected[j] + 1
            expected[i] = max(expected[i], candidate)
    return expected


# -------------------------
# Basic functionality tests
# -------------------------


def test_basic_increasing_sequence_updates_dp_correctly():
    # Arrange: strictly increasing array; initialize dp with ones (typical LIS base)
    arr = jnp.array([1, 2, 3, 4], dtype=jnp.int32)
    dp = jnp.ones_like(arr)  # [1,1,1,1]
    i = 3  # compute dp for the last index
    # Act: call jitted function under test
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 103ms -> 255μs (40658% faster)
    # Convert to Python list to assert using plain assert
    result_list = result.tolist()
    # Expected: for index 3, LIS length should be 4 (1 + number of previous increasing elements)
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


def test_basic_decreasing_sequence_leaves_dp_unchanged():
    # Arrange: strictly decreasing array; dp initialized to ones
    arr = jnp.array([4, 3, 2, 1], dtype=jnp.int32)
    dp = jnp.ones_like(arr)
    i = 3
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 105ms -> 249μs (42044% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


def test_equal_elements_do_not_increase_dp():
    # Arrange: all equal elements; dp initialized to ones
    arr = jnp.array([2, 2, 2], dtype=jnp.int32)
    dp = jnp.ones_like(arr)
    i = 2
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 104ms -> 235μs (44515% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


def test_preserves_higher_initial_dp_values():
    # Arrange: dp has a higher initial value at index i than any candidate dp[j]+1
    arr = jnp.array([1, 3, 2], dtype=jnp.int32)
    # dp[2] is artificially large (5); function should not lower it
    dp = jnp.array([1, 2, 5], dtype=jnp.int32)
    i = 2
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 104ms -> 227μs (45611% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


def test_i_zero_returns_dp_unchanged():
    # Arrange: any dp and arr; i = 0 should result in no iterations
    arr = jnp.array([10, 20, 30], dtype=jnp.int32)
    dp = jnp.array([7, 8, 9], dtype=jnp.int32)  # arbitrary values
    i = 0
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 19.6ms -> 172μs (11285% faster)
    result_list = result.tolist()


# -------------------------
# Edge case tests
# -------------------------


def test_negative_values_handled_correctly():
    # Arrange: negative and mixed values in array
    arr = jnp.array([-5, -2, 0, 3], dtype=jnp.int32)
    dp = jnp.ones_like(arr)
    i = 3  # last index should see all previous as less -> dp[3] == 4
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 104ms -> 244μs (42575% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


def test_nontrivial_dp_values_with_duplicates_and_mixed_order():
    # Arrange: duplicates and mixed order, dp has varied initial values
    arr = jnp.array([1, 2, 2, 3, 2], dtype=jnp.int32)
    dp = jnp.array([1, 2, 2, 1, 3], dtype=jnp.int32)
    i = 4
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 105ms -> 274μs (38237% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


# -------------------------
# Randomized small tests for consistency
# -------------------------


def test_random_small_arrays_consistency():
    # Use JAX PRNG for deterministic random tests
    key = jax.random.PRNGKey(0)
    # Generate a small array of integers (size 10) with values in a small range to force duplicates
    arr = jax.random.randint(key, shape=(10,), minval=-3, maxval=6, dtype=jnp.int32)
    # Use dp initialized to ones
    dp = jnp.ones_like(arr)
    # Test multiple i values inside a single test to keep test suite compact
    for i in [1, 3, 5, 9]:
        # Act
        codeflash_output = _lis_outer_body_jax(i, dp, arr)
        result = codeflash_output  # 405ms -> 1.14ms (35518% faster)
        result_list = result.tolist()
        expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


# -------------------------
# Large scale tests (within constraints)
# -------------------------


def test_large_scale_increasing_sequence_performance_and_correctness():
    # Arrange: sizable but under 1000 elements as required (choose 500)
    n = 500
    # Strictly increasing sequence - worst-case for LIS computation because dp[i] will grow to i+1
    arr = jnp.arange(n, dtype=jnp.int32)
    dp = jnp.ones_like(arr)
    i = n - 1  # compute for the last index; loop will run i iterations (499) which is under 1000
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 110ms -> 9.73ms (1034% faster)
    result_list = result.tolist()
    # Expected: dp[last] == n and earlier entries remain as initialized (ones)
    expected = dp.tolist()
    expected[i] = n  # since dp indices are 1-based lengths for LIS when initialized with ones


def test_large_scale_mixed_pattern_correctness():
    # Arrange: create a repeating up-down pattern that is non-trivial but still <1000 length
    n = 300
    # Pattern: 0..149, 149..0 repeated - this produces mixed local increasing regions
    first = jnp.arange(150, dtype=jnp.int32)
    second = jnp.arange(150 - 1, -1, -1, dtype=jnp.int32)
    arr = jnp.concatenate([first, second])  # total length 300
    dp = jnp.ones_like(arr)
    # Choose a middle index to exercise varied comparisons
    i = 200
    # Act
    codeflash_output = _lis_outer_body_jax(i, dp, arr)
    result = codeflash_output  # 107ms -> 4.04ms (2562% faster)
    result_list = result.tolist()
    expected = compute_expected_dp_for_i(i, dp.tolist(), arr.tolist())


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import jax
import jax.numpy as jnp

from code_to_optimize.sample_code import _lis_outer_body_jax


def _lis_inner_body_jax(j, dp_inner, arr, i):
    condition = (arr[j] < arr[i]) & (dp_inner[j] + 1 > dp_inner[i])
    new_val = jnp.where(condition, dp_inner[j] + 1, dp_inner[i])
    return dp_inner.at[i].set(new_val)


class TestLisOuterBodyJaxBasic:
    """Basic test cases for _lis_outer_body_jax function."""

    def test_single_element_array(self):
        """Test with a single element array - outer loop should not update."""
        arr = jnp.array([5.0], dtype=jnp.float32)
        dp = jnp.array([1.0], dtype=jnp.float32)
        # When i=0, fori_loop runs from 0 to 0 (no iterations), so dp should remain unchanged
        codeflash_output = _lis_outer_body_jax(0, dp, arr)
        result = codeflash_output  # 19.3ms -> 175μs (10898% faster)

    def test_two_element_increasing_sequence(self):
        """Test with two elements in increasing order."""
        arr = jnp.array([1.0, 2.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0], dtype=jnp.float32)
        # i=1, arr[0]=1.0 < arr[1]=2.0, so dp[1] should become max(1, dp[0]+1) = 2
        codeflash_output = _lis_outer_body_jax(1, dp, arr)
        result = codeflash_output  # 88.2ms -> 208μs (42200% faster)
        expected = jnp.array([1.0, 2.0], dtype=jnp.float32)

    def test_two_element_decreasing_sequence(self):
        """Test with two elements in decreasing order."""
        arr = jnp.array([2.0, 1.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0], dtype=jnp.float32)
        # i=1, arr[0]=2.0 > arr[1]=1.0, so condition is false, dp[1] stays 1
        codeflash_output = _lis_outer_body_jax(1, dp, arr)
        result = codeflash_output  # 88.5ms -> 205μs (43040% faster)
        expected = jnp.array([1.0, 1.0], dtype=jnp.float32)

    def test_three_element_increasing_sequence(self):
        """Test with three elements in fully increasing order."""
        arr = jnp.array([1.0, 2.0, 3.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)
        # After processing i=1: dp becomes [1.0, 2.0, 1.0]
        # Then process i=2: j loops 0,1
        #   j=0: arr[0]=1 < arr[2]=3, dp[0]+1=2 > dp[2]=1, so dp[2] becomes 2
        #   j=1: arr[1]=2 < arr[2]=3, dp[1]+1=3 > dp[2]=2, so dp[2] becomes 3
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 104ms -> 234μs (44626% faster)
        expected = jnp.array([1.0, 1.0, 3.0], dtype=jnp.float32)

    def test_identical_elements(self):
        """Test with all identical elements."""
        arr = jnp.array([5.0, 5.0, 5.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)
        # arr[j] < arr[i] is never true for identical elements, so no updates
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 105ms -> 222μs (47250% faster)
        expected = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)

    def test_negative_numbers(self):
        """Test with negative numbers in the array."""
        arr = jnp.array([-3.0, -1.0, -2.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)
        # At i=2: j=0 (arr[0]=-3 < arr[2]=-2), dp[0]+1=2 > dp[2]=1, so dp[2] becomes 2
        #         j=1 (arr[1]=-1 NOT < arr[2]=-2), no change
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 105ms -> 220μs (47658% faster)
        expected = jnp.array([1.0, 1.0, 2.0], dtype=jnp.float32)

    def test_zero_values(self):
        """Test with arrays containing zero values."""
        arr = jnp.array([0.0, 1.0, 0.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)
        # At i=2: j=0 (arr[0]=0 NOT < arr[2]=0), no change
        #         j=1 (arr[1]=1 > arr[2]=0), no change
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 104ms -> 224μs (46497% faster)
        expected = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)


class TestLisOuterBodyJaxEdgeCases:
    """Edge case tests for _lis_outer_body_jax function."""

    def test_very_large_dp_values(self):
        """Test with very large DP values that could overflow in non-JAX implementations."""
        arr = jnp.array([1.0, 2.0], dtype=jnp.float32)
        dp = jnp.array([1e6, 1e6], dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(1, dp, arr)
        result = codeflash_output  # 87.8ms -> 195μs (44799% faster)

    def test_very_small_dp_values(self):
        """Test with very small DP values."""
        arr = jnp.array([1.0, 2.0], dtype=jnp.float32)
        dp = jnp.array([1e-6, 1e-6], dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(1, dp, arr)
        result = codeflash_output  # 87.8ms -> 200μs (43720% faster)

    def test_descending_sequence(self):
        """Test with a strictly descending sequence."""
        arr = jnp.array([5.0, 4.0, 3.0, 2.0, 1.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0, 1.0, 1.0], dtype=jnp.float32)
        # No element to the left is smaller, so all dp values remain 1
        codeflash_output = _lis_outer_body_jax(4, dp, arr)
        result = codeflash_output  # 103ms -> 270μs (38380% faster)
        expected = jnp.array([1.0, 1.0, 1.0, 1.0, 1.0], dtype=jnp.float32)

    def test_i_equals_zero(self):
        """Test when i=0 (first index - no previous elements to compare)."""
        arr = jnp.array([10.0, 20.0, 30.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float32)
        # fori_loop(0, 0, ...) executes 0 iterations, so dp unchanged
        codeflash_output = _lis_outer_body_jax(0, dp, arr)
        result = codeflash_output  # 18.8ms -> 168μs (11014% faster)

    def test_mixed_positive_negative(self):
        """Test with mixed positive and negative values."""
        arr = jnp.array([-10.0, 5.0, -3.0, 8.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0, 1.0], dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(3, dp, arr)
        result = codeflash_output  # 103ms -> 252μs (40987% faster)

    def test_float64_dtype(self):
        """Test with float64 dtype for higher precision."""
        arr = jnp.array([1.0, 2.0, 3.0], dtype=jnp.float64)
        dp = jnp.array([1.0, 1.0, 1.0], dtype=jnp.float64)
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 104ms -> 215μs (48168% faster)

    def test_int32_dtype(self):
        """Test with int32 dtype for integer values."""
        arr = jnp.array([1, 2, 3], dtype=jnp.int32)
        dp = jnp.array([1, 1, 1], dtype=jnp.int32)
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 105ms -> 232μs (45186% faster)
        expected = jnp.array([1, 1, 3], dtype=jnp.int32)

    def test_int64_dtype(self):
        """Test with int64 dtype."""
        arr = jnp.array([1, 2, 3], dtype=jnp.int64)
        dp = jnp.array([1, 1, 1], dtype=jnp.int64)
        codeflash_output = _lis_outer_body_jax(2, dp, arr)
        result = codeflash_output  # 103ms -> 222μs (46385% faster)

    def test_alternating_sequence(self):
        """Test with alternating up-down pattern."""
        arr = jnp.array([1.0, 3.0, 2.0, 4.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0, 1.0], dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(3, dp, arr)
        result = codeflash_output  # 103ms -> 248μs (41576% faster)


class TestLisOuterBodyJaxComplex:
    """Complex and integration test cases."""

    def test_longer_increasing_sequence(self):
        """Test with a longer strictly increasing sequence."""
        arr = jnp.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=jnp.float32)
        dp = jnp.array([1.0] * 8, dtype=jnp.float32)
        # After processing all elements, dp should reflect correct LIS lengths
        codeflash_output = _lis_outer_body_jax(7, dp, arr)
        result = codeflash_output  # 104ms -> 330μs (31404% faster)

    def test_complex_lis_pattern(self):
        """Test with a complex pattern that forms multiple LIS."""
        arr = jnp.array([3.0, 1.0, 4.0, 1.0, 5.0, 9.0], dtype=jnp.float32)
        dp = jnp.array([1.0] * 6, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(5, dp, arr)
        result = codeflash_output  # 105ms -> 289μs (36157% faster)

    def test_repeated_updates_at_same_index(self):
        """Test that multiple predecessors can update the same DP value."""
        arr = jnp.array([1.0, 2.0, 3.0, 2.5], dtype=jnp.float32)
        dp = jnp.array([1.0, 1.0, 1.0, 1.0], dtype=jnp.float32)
        # At i=3: arr[3]=2.5
        #   j=0: arr[0]=1 < 2.5, dp[0]+1=2 > dp[3]=1, dp[3] becomes 2
        #   j=1: arr[1]=2 < 2.5, dp[1]+1=2 NOT > dp[3]=2, no change
        #   j=2: arr[2]=3 NOT < 2.5, no change
        codeflash_output = _lis_outer_body_jax(3, dp, arr)
        result = codeflash_output  # 104ms -> 243μs (42880% faster)

    def test_maintains_previous_dp_values(self):
        """Test that processing at index i doesn't affect previous DP values."""
        arr = jnp.array([1.0, 2.0, 3.0, 4.0], dtype=jnp.float32)
        dp = jnp.array([1.0, 2.0, 3.0, 1.0], dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(3, dp, arr)
        result = codeflash_output  # 104ms -> 239μs (43422% faster)


class TestLisOuterBodyJaxLargeScale:
    """Large scale test cases for performance and scalability."""

    def test_large_increasing_sequence(self):
        """Test with a large strictly increasing sequence (100 elements)."""
        size = 100
        arr = jnp.arange(1.0, size + 1.0, dtype=jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 106ms -> 2.16ms (4814% faster)

    def test_large_array_performance(self):
        """Test performance with a large array (500 elements)."""
        size = 500
        arr = jnp.arange(1.0, size + 1.0, dtype=jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        # This should complete without error and with correct semantics
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 111ms -> 9.83ms (1039% faster)

    def test_large_random_sequence(self):
        """Test with large random sequence (200 elements)."""
        size = 200
        key = jax.random.PRNGKey(42)
        arr = jax.random.uniform(key, (size,), minval=0.0, maxval=1000.0, dtype=jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 106ms -> 4.03ms (2546% faster)

    def test_large_descending_sequence(self):
        """Test with large strictly descending sequence (300 elements)."""
        size = 300
        arr = jnp.arange(size, 0.0, -1.0, dtype=jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 108ms -> 5.94ms (1720% faster)

    def test_large_sawtooth_pattern(self):
        """Test with large sawtooth pattern (repeating up-down 100 times)."""
        size = 200
        pattern = jnp.array([1.0, 2.0] * (size // 2), dtype=jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, pattern)
        result = codeflash_output  # 106ms -> 4.04ms (2531% faster)

    def test_large_array_with_duplicates(self):
        """Test with large array containing many duplicates (250 elements)."""
        size = 250
        # Create array with many duplicates: [1, 1, 1, ..., 2, 2, 2, ..., 3, 3, 3, ...]
        arr = jnp.repeat(jnp.arange(1.0, 11.0, dtype=jnp.float32), 25)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 108ms -> 5.05ms (2046% faster)

    def test_large_pyramid_pattern(self):
        """Test with pyramid pattern: increases then decreases (180 elements)."""
        size = 180
        # Create pyramid: [1,2,3,...,90,...,3,2,1]
        up = jnp.arange(1.0, size // 2 + 1.0, dtype=jnp.float32)
        down = jnp.arange(size // 2 - 1.0, 0.0, -1.0, dtype=jnp.float32)
        arr = jnp.concatenate([up, down])
        dp = jnp.ones(len(arr), dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(len(arr) - 1, dp, arr)
        result = codeflash_output  # 106ms -> 3.62ms (2851% faster)

    def test_large_alternating_min_max(self):
        """Test with large alternating min/max pattern (220 elements)."""
        size = 220
        # Create alternating pattern: [0, 1000, 0, 1000, ...]
        arr = jnp.where(jnp.arange(size) % 2 == 0, 0.0, 1000.0).astype(jnp.float32)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 107ms -> 4.39ms (2348% faster)

    def test_large_nearly_sorted_sequence(self):
        """Test with nearly sorted sequence with few anomalies (280 elements)."""
        size = 280
        arr = jnp.arange(1.0, size + 1.0, dtype=jnp.float32)
        # Introduce a few inversions
        arr = arr.at[50].set(arr[51] - 1)
        arr = arr.at[150].set(arr[151] - 1)
        dp = jnp.ones(size, dtype=jnp.float32)
        codeflash_output = _lis_outer_body_jax(size - 1, dp, arr)
        result = codeflash_output  # 108ms -> 5.56ms (1851% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_lis_outer_body_jax-mkgj0d5t and push.

The optimized code achieves a **62x speedup** (6121%) by adding `@jit` decorators to both `_lis_inner_body_jax` and `_lis_outer_body_jax` functions. This simple change enables JAX's Just-In-Time compilation, which fundamentally transforms how the code executes. **What changed:** - Added `@jit` decorator to both functions - Added `jit` to the imports from `jax` **Why this makes the code faster:** 1. **Eliminates Python interpreter overhead**: Without JIT, each array operation (`arr[j]`, `dp_inner[j]`, comparisons, etc.) triggers Python function calls and type checks. The line profiler shows the original `_lis_inner_body_jax` spent 0.537s on just 42 iterations. With JIT, these operations are compiled once into optimized machine code. 2. **Enables operation fusion**: JAX's compiler can fuse the sequence of operations in `_lis_inner_body_jax` (comparison → logical AND → jnp.where → array update) into a single optimized kernel, eliminating intermediate array allocations and memory transfers. 3. **Optimizes the hot loop**: The original line profiler shows `lax.fori_loop` taking 5.52s (100% of `_lis_outer_body_jax` time). With JIT, JAX optimizes the entire loop body, including the partial function application, into efficient compiled code that runs directly on the accelerator (GPU/TPU) or CPU without Python overhead. 4. **Amortizes compilation cost**: The first call compiles the function (visible in the ~20-110ms range for first calls in tests), but subsequent calls with same-shaped inputs reuse the compiled version. This is why tests show speedups from 1034% (large arrays) to 48000% (small arrays) - smaller inputs benefit more from eliminating per-call overhead. **Performance characteristics based on test results:** - Small arrays (2-10 elements): 40,000-48,000% speedup - compilation overhead is tiny compared to per-call Python overhead savings - Medium arrays (100-200 elements): 2,500-4,800% speedup - good balance between compilation benefit and workload - Large arrays (500 elements): 1,034-2,562% speedup - computation time dominates, but still significant gains from fused operations **Impact on workloads:** Since this appears to be implementing a longest increasing subsequence (LIS) dynamic programming algorithm, the optimization would be particularly beneficial for: - Repeated LIS computations on similar-sized arrays (compilation happens once) - Batch processing scenarios where the function is called many times - Real-time applications where sub-millisecond latency matters

codeflash-ai bot requested a review from aseembits93 January 16, 2026 06:57

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_lis_outer_body_jax` by 6,122% #1077

⚡️ Speed up function `_lis_outer_body_jax` by 6,122% #1077

codeflash-ai bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _lis_outer_body_jax by 6,122% #1077

Are you sure you want to change the base?

⚡️ Speed up function _lis_outer_body_jax by 6,122% #1077

Conversation

codeflash-ai bot commented Jan 16, 2026

📄 6,122% (61.22x) speedup for _lis_outer_body_jax in code_to_optimize/sample_code.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_lis_outer_body_jax` by 6,122% #1077

⚡️ Speed up function `_lis_outer_body_jax` by 6,122% #1077

📄 6,122% (61.22x) speedup for `_lis_outer_body_jax` in `code_to_optimize/sample_code.py`