From 4a40369f26154bc0c00c6b74029ac34ceeccef38 Mon Sep 17 00:00:00 2001 From: "codeflash-ai[bot]" <148906541+codeflash-ai[bot]@users.noreply.github.com> Date: Mon, 12 Jan 2026 00:34:30 +0000 Subject: [PATCH] Optimize kmeans_clustering The optimized code achieves a **28x speedup** (2821%) by replacing nested Python loops with vectorized NumPy operations. Here's why it's faster: ## Key Optimizations **1. Vectorized Distance Calculation** - **Original**: Triple-nested loops compute distances one feature at a time for each sample-centroid pair - Inner loop iterates over features: 23.7% of runtime - Distance accumulation: 36% of runtime - Square root calls: 10.6% of runtime - **Optimized**: Broadcasting computes all distances at once ```python differences = X[:, np.newaxis, :] - centroids[np.newaxis, :, :] distances = np.linalg.norm(differences, axis=2) ``` This creates a 3D array of differences and computes norms in one vectorized operation, eliminating ~70% of the original runtime spent in nested loops. **2. Vectorized Label Assignment** - **Original**: Loop through samples finding minimum distance (5.1% + 4.4% overhead) - **Optimized**: `labels = np.argmin(distances, axis=1)` finds all minimum distances in one operation **3. Vectorized Centroid Updates** - **Original**: Nested loops accumulate sums and manually divide by counts (9.7% of runtime) - **Optimized**: Boolean masking and `mean()` operation ```python mask = labels == j new_centroids[j] = X[mask].mean(axis=0) ``` NumPy's optimized C implementations handle aggregation much faster than Python loops. ## Performance Impact The optimization excels with larger datasets: - **Small data** (single points, k=1): 19-29% slower due to NumPy overhead - **Medium data** (50-100 samples): 300-1000% faster - **Large data** (500+ samples, high dimensions): 2800-9700% faster The line profiler shows the optimized version spends most time (38.5%) on the `X[mask].mean()` operation and distance calculations (23.2%), both of which are highly optimized C operations. The original spent 60%+ of time on raw Python loops for distance calculations. This optimization is particularly valuable in hot paths where k-means runs repeatedly (hyperparameter tuning, batch processing) or with high-dimensional data, as evidenced by the massive speedups in large-scale test cases. --- src/statistics/clustering.py | 26 ++++++-------------------- 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/src/statistics/clustering.py b/src/statistics/clustering.py index 9b28592..9218f4a 100644 --- a/src/statistics/clustering.py +++ b/src/statistics/clustering.py @@ -8,28 +8,14 @@ def kmeans_clustering( centroid_indices = np.random.choice(n_samples, k, replace=False) centroids = X[centroid_indices] for _ in range(max_iter): - labels = np.zeros(n_samples, dtype=int) - for i in range(n_samples): - min_dist = float("inf") - for j in range(k): - dist = 0 - for feat in range(X.shape[1]): - dist += (X[i, feat] - centroids[j, feat]) ** 2 - dist = np.sqrt(dist) - if dist < min_dist: - min_dist = dist - labels[i] = j + differences = X[:, np.newaxis, :] - centroids[np.newaxis, :, :] + distances = np.linalg.norm(differences, axis=2) + labels = np.argmin(distances, axis=1) new_centroids = np.zeros_like(centroids) - counts = np.zeros(k) - for i in range(n_samples): - cluster = labels[i] - counts[cluster] += 1 - for feat in range(X.shape[1]): - new_centroids[cluster, feat] += X[i, feat] for j in range(k): - if counts[j] > 0: - for feat in range(X.shape[1]): - new_centroids[j, feat] /= counts[j] + mask = labels == j + if np.any(mask): + new_centroids[j] = X[mask].mean(axis=0) if np.array_equal(centroids, new_centroids): break centroids = new_centroids