From aa09b8ee97a97a2be455a27dda2614c8293efd88 Mon Sep 17 00:00:00 2001 From: "codeflash-ai[bot]" <148906541+codeflash-ai[bot]@users.noreply.github.com> Date: Mon, 12 Jan 2026 00:18:26 +0000 Subject: [PATCH] Optimize image_rotation Brief explanation of why and how optimized_source_code is faster than original_source_code. What changed (key optimizations) - Removed the nested Python loops over every output pixel and replaced them with NumPy vectorized operations. - Constructed coordinate grids once with ys = arange(... )[:, None] and xs = arange(...)[None, :]. - Computed mapped coordinates original_yf and original_xf as whole 2D arrays using broadcasted arithmetic. - Converted to integer indices (astype(int)) and built a boolean valid_mask for bounds-checking in one step. - Performed a single advanced-indexed assignment: rotated[valid_mask] = image[original_y[valid_mask], original_x[valid_mask]]. - Added an early exit for new_height == 0 or new_width == 0 to avoid unnecessary array work. Why this is faster (technical reasoning) - Eliminates per-pixel Python overhead: the original code spent almost all time in Python-level loops doing arithmetic, int conversions, and bounds checks for each (x,y) pair. Each iteration incurred interpreter overhead. The line profiler shows those inner loops dominated runtime. - Moves heavy work into NumPy's C-implemented loops: computing ys/xs, the matrix arithmetic, astype, mask creation, and the indexed copy are all executed in compiled code (tight C loops, often vectorized). This reduces thousands of Python-level operations to a few array ops. - Fewer conditional checks and function calls per pixel: bounds checks are done with a single vectorized boolean mask rather than an if per pixel. - Final assignment touches only valid pixels (via the mask), so fewer memory writes than attempting to assign every output pixel in Python. Measured results that support this - Wall-clock: original 2.70 ms -> optimized 0.728 ms (~2.7x speedup). - Line profiler: original dominated by the nested loops and per-pixel arithmetic; optimized version's cost is concentrated in a few NumPy lines (array creation, arithmetic, mask, and one bulk assignment), which are much cheaper overall. Behavioral/compatibility notes - Behavior is preserved: integer truncation semantics are the same (astype(int) truncates floats like the original int()). Channel handling is preserved: the single advanced-index assignment works for both grayscale (2D) and multi-channel images because NumPy advanced-indexing returns the appropriate shape to assign per-pixel vectors. - Early-return for zero-dimension outputs avoids creating unnecessary arrays (safe and slightly faster for empty inputs). - Memory tradeoff: the optimized code allocates several temporaries (ys, xs, original_*f, original_* ints, mask). This increases peak memory usage proportional to new image size versus the original in-place per-pixel loop which used less temporary memory. In practice, for typical small-to-moderate images the CPU cost saved overwhelms the extra temporaries; for extremely large images memory pressure could be a concern. When this optimization helps most / tradeoffs - Big wins for medium-to-large images where Python loop overhead dominates (tests with 31x31 and similar sizes show huge speedups). - For extremely small images (single pixel or tiny shapes) the cost to allocate intermediate NumPy arrays can dominate; the microbenchmarks in annotated_tests show some tiny-case timings where the optimized version is not faster or is slightly slower. If the function is frequently called with very small images in a tight loop, consider: - keeping original loop-based path for very small sizes (threshold), or - reusing preallocated arrays, or - using a JIT approach (numba) if both memory and per-call overhead must be minimized. - For typical usage (hot path with many pixels), the optimized vectorized approach is preferable. Summary - The optimized version converts thousands of Python-level per-pixel operations into a handful of NumPy array operations executed in C, eliminating interpreter overhead and per-pixel branching. This change is why we see the measured ~2.7x runtime improvement on representative tests, with large wins on larger images and acceptable tradeoffs (extra temporaries) for most workloads. --- src/signal/image.py | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/src/signal/image.py b/src/signal/image.py index 68f01a4..64fbcb5 100644 --- a/src/signal/image.py +++ b/src/signal/image.py @@ -15,14 +15,29 @@ def image_rotation(image: np.ndarray, angle_degrees: float) -> np.ndarray: else (new_height, new_width) ) new_center_y, new_center_x = new_height // 2, new_width // 2 - for y in range(new_height): - for x in range(new_width): - offset_y = y - new_center_y - offset_x = x - new_center_x - original_y = int(offset_y * cos_theta - offset_x * sin_theta + center_y) - original_x = int(offset_y * sin_theta + offset_x * cos_theta + center_x) - if 0 <= original_y < height and 0 <= original_x < width: - rotated[y, x] = image[original_y, original_x] + + if new_height == 0 or new_width == 0: + return rotated + + ys = np.arange(new_height, dtype=float)[:, None] - new_center_y + xs = np.arange(new_width, dtype=float)[None, :] - new_center_x + + original_yf = ys * cos_theta - xs * sin_theta + center_y + original_xf = ys * sin_theta + xs * cos_theta + center_x + + original_y = original_yf.astype(int) + original_x = original_xf.astype(int) + + valid_mask = ( + (original_y >= 0) + & (original_y < height) + & (original_x >= 0) + & (original_x < width) + ) + + # Single assignment works for both grayscale and color images: + rotated[valid_mask] = image[original_y[valid_mask], original_x[valid_mask]] + return rotated