From aa09b8ee97a97a2be455a27dda2614c8293efd88 Mon Sep 17 00:00:00 2001
From: "codeflash-ai[bot]"
 <148906541+codeflash-ai[bot]@users.noreply.github.com>
Date: Mon, 12 Jan 2026 00:18:26 +0000
Subject: [PATCH] Optimize image_rotation

Brief explanation of why and how optimized_source_code is faster than original_source_code.

What changed (key optimizations)
- Removed the nested Python loops over every output pixel and replaced them with NumPy vectorized operations.
  - Constructed coordinate grids once with ys = arange(... )[:, None] and xs = arange(...)[None, :].
  - Computed mapped coordinates original_yf and original_xf as whole 2D arrays using broadcasted arithmetic.
  - Converted to integer indices (astype(int)) and built a boolean valid_mask for bounds-checking in one step.
  - Performed a single advanced-indexed assignment: rotated[valid_mask] = image[original_y[valid_mask], original_x[valid_mask]].
- Added an early exit for new_height == 0 or new_width == 0 to avoid unnecessary array work.

Why this is faster (technical reasoning)
- Eliminates per-pixel Python overhead: the original code spent almost all time in Python-level loops doing arithmetic, int conversions, and bounds checks for each (x,y) pair. Each iteration incurred interpreter overhead. The line profiler shows those inner loops dominated runtime.
- Moves heavy work into NumPy's C-implemented loops: computing ys/xs, the matrix arithmetic, astype, mask creation, and the indexed copy are all executed in compiled code (tight C loops, often vectorized). This reduces thousands of Python-level operations to a few array ops.
- Fewer conditional checks and function calls per pixel: bounds checks are done with a single vectorized boolean mask rather than an if per pixel.
- Final assignment touches only valid pixels (via the mask), so fewer memory writes than attempting to assign every output pixel in Python.

Measured results that support this
- Wall-clock: original 2.70 ms -> optimized 0.728 ms (~2.7x speedup).
- Line profiler: original dominated by the nested loops and per-pixel arithmetic; optimized version's cost is concentrated in a few NumPy lines (array creation, arithmetic, mask, and one bulk assignment), which are much cheaper overall.

Behavioral/compatibility notes
- Behavior is preserved: integer truncation semantics are the same (astype(int) truncates floats like the original int()). Channel handling is preserved: the single advanced-index assignment works for both grayscale (2D) and multi-channel images because NumPy advanced-indexing returns the appropriate shape to assign per-pixel vectors.
- Early-return for zero-dimension outputs avoids creating unnecessary arrays (safe and slightly faster for empty inputs).
- Memory tradeoff: the optimized code allocates several temporaries (ys, xs, original_*f, original_* ints, mask). This increases peak memory usage proportional to new image size versus the original in-place per-pixel loop which used less temporary memory. In practice, for typical small-to-moderate images the CPU cost saved overwhelms the extra temporaries; for extremely large images memory pressure could be a concern.

When this optimization helps most / tradeoffs
- Big wins for medium-to-large images where Python loop overhead dominates (tests with 31x31 and similar sizes show huge speedups).
- For extremely small images (single pixel or tiny shapes) the cost to allocate intermediate NumPy arrays can dominate; the microbenchmarks in annotated_tests show some tiny-case timings where the optimized version is not faster or is slightly slower. If the function is frequently called with very small images in a tight loop, consider:
  - keeping original loop-based path for very small sizes (threshold), or
  - reusing preallocated arrays, or
  - using a JIT approach (numba) if both memory and per-call overhead must be minimized.
- For typical usage (hot path with many pixels), the optimized vectorized approach is preferable.

Summary
- The optimized version converts thousands of Python-level per-pixel operations into a handful of NumPy array operations executed in C, eliminating interpreter overhead and per-pixel branching. This change is why we see the measured ~2.7x runtime improvement on representative tests, with large wins on larger images and acceptable tradeoffs (extra temporaries) for most workloads.
---
 src/signal/image.py | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/src/signal/image.py b/src/signal/image.py
index 68f01a4..64fbcb5 100644
--- a/src/signal/image.py
+++ b/src/signal/image.py
@@ -15,14 +15,29 @@ def image_rotation(image: np.ndarray, angle_degrees: float) -> np.ndarray:
         else (new_height, new_width)
     )
     new_center_y, new_center_x = new_height // 2, new_width // 2
-    for y in range(new_height):
-        for x in range(new_width):
-            offset_y = y - new_center_y
-            offset_x = x - new_center_x
-            original_y = int(offset_y * cos_theta - offset_x * sin_theta + center_y)
-            original_x = int(offset_y * sin_theta + offset_x * cos_theta + center_x)
-            if 0 <= original_y < height and 0 <= original_x < width:
-                rotated[y, x] = image[original_y, original_x]
+
+    if new_height == 0 or new_width == 0:
+        return rotated
+
+    ys = np.arange(new_height, dtype=float)[:, None] - new_center_y
+    xs = np.arange(new_width, dtype=float)[None, :] - new_center_x
+
+    original_yf = ys * cos_theta - xs * sin_theta + center_y
+    original_xf = ys * sin_theta + xs * cos_theta + center_x
+
+    original_y = original_yf.astype(int)
+    original_x = original_xf.astype(int)
+
+    valid_mask = (
+        (original_y >= 0)
+        & (original_y < height)
+        & (original_x >= 0)
+        & (original_x < width)
+    )
+
+    # Single assignment works for both grayscale and color images:
+    rotated[valid_mask] = image[original_y[valid_mask], original_x[valid_mask]]
+
     return rotated