Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 12, 2026

📄 2,821% (28.21x) speedup for kmeans_clustering in src/statistics/clustering.py

⏱️ Runtime : 475 milliseconds 16.3 milliseconds (best of 145 runs)

📝 Explanation and details

The optimized code achieves a 28x speedup (2821%) by replacing nested Python loops with vectorized NumPy operations. Here's why it's faster:

Key Optimizations

1. Vectorized Distance Calculation

  • Original: Triple-nested loops compute distances one feature at a time for each sample-centroid pair
    • Inner loop iterates over features: 23.7% of runtime
    • Distance accumulation: 36% of runtime
    • Square root calls: 10.6% of runtime
  • Optimized: Broadcasting computes all distances at once
    differences = X[:, np.newaxis, :] - centroids[np.newaxis, :, :]
    distances = np.linalg.norm(differences, axis=2)
    This creates a 3D array of differences and computes norms in one vectorized operation, eliminating ~70% of the original runtime spent in nested loops.

2. Vectorized Label Assignment

  • Original: Loop through samples finding minimum distance (5.1% + 4.4% overhead)
  • Optimized: labels = np.argmin(distances, axis=1) finds all minimum distances in one operation

3. Vectorized Centroid Updates

  • Original: Nested loops accumulate sums and manually divide by counts (9.7% of runtime)
  • Optimized: Boolean masking and mean() operation
    mask = labels == j
    new_centroids[j] = X[mask].mean(axis=0)
    NumPy's optimized C implementations handle aggregation much faster than Python loops.

Performance Impact

The optimization excels with larger datasets:

  • Small data (single points, k=1): 19-29% slower due to NumPy overhead
  • Medium data (50-100 samples): 300-1000% faster
  • Large data (500+ samples, high dimensions): 2800-9700% faster

The line profiler shows the optimized version spends most time (38.5%) on the X[mask].mean() operation and distance calculations (23.2%), both of which are highly optimized C operations. The original spent 60%+ of time on raw Python loops for distance calculations.

This optimization is particularly valuable in hot paths where k-means runs repeatedly (hyperparameter tuning, batch processing) or with high-dimensional data, as evidenced by the massive speedups in large-scale test cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 47 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import numpy as np

# imports
import pytest  # used for our unit tests
from src.statistics.clustering import kmeans_clustering

# unit tests

# Basic functionality tests


def test_basic_two_clusters_separable():
    """
    Basic case: two well-separated 2D clusters.
    Expectation: kmeans finds centroids near the true centers and assigns labels accordingly.
    """
    # Create two clusters centered at (0,0) and (10,10)
    rng = np.random.RandomState(42)
    cluster1 = rng.normal(loc=0.0, scale=0.5, size=(50, 2))
    cluster2 = rng.normal(loc=10.0, scale=0.5, size=(50, 2))
    X = np.vstack([cluster1, cluster2])

    # Set global numpy seed to make centroid initialization deterministic for the algorithm
    np.random.seed(0)
    centroids, labels = kmeans_clustering(
        X, k=2, max_iter=100
    )  # 800μs -> 86.5μs (825% faster)

    # Determine which centroid corresponds to which true center by distance
    true_centers = np.array([[0.0, 0.0], [10.0, 10.0]])
    # For each found centroid, find distance to nearest true center
    dists_to_true = np.array(
        [np.min(np.sqrt(np.sum((c - true_centers) ** 2, axis=1))) for c in centroids]
    )

    # Labels should partition samples roughly into two groups near 50 each
    unique, counts = np.unique(labels, return_counts=True)


def test_single_point_k_equals_one():
    """
    Edge/basic case: single sample with k=1.
    Expectation: centroid equals the point and label is 0.
    """
    X = np.array([[3.14, -2.72]])
    np.random.seed(1)
    centroids, labels = kmeans_clustering(
        X, k=1, max_iter=10
    )  # 25.4μs -> 34.4μs (26.1% slower)


def test_deterministic_with_fixed_seed():
    """
    Determinism: with a fixed random seed, multiple runs should return identical results.
    """
    rng = np.random.RandomState(0)
    X = rng.normal(size=(60, 3))
    np.random.seed(123)  # seed for the algorithm's initialization
    cent1, lab1 = kmeans_clustering(
        X, k=3, max_iter=50
    )  # 3.23ms -> 318μs (915% faster)
    np.random.seed(123)  # reset seed to same value
    cent2, lab2 = kmeans_clustering(
        X, k=3, max_iter=50
    )  # 3.22ms -> 299μs (975% faster)


# Edge cases


def test_k_greater_than_number_of_samples_raises_value_error():
    """
    Edge case: k > n_samples should raise an error from np.random.choice.
    Expectation: ValueError is raised because replace=False cannot pick more distinct indices than samples.
    """
    X = np.zeros((3, 2))
    np.random.seed(0)
    with pytest.raises(ValueError):
        # k=5 > 3 should cause np.random.choice to raise ValueError
        kmeans_clustering(X, k=5)  # 8.96μs -> 7.75μs (15.6% faster)


def test_max_iter_zero_raises_unboundlocalerror():
    """
    Edge case: max_iter = 0 will cause the local variable 'labels' to be referenced before assignment.
    Expectation: UnboundLocalError is raised.
    """
    X = np.array([[0.0, 0.0], [1.0, 1.0]])
    np.random.seed(0)
    # When max_iter is 0 the function's inner loop never executes and 'labels' is not assigned.
    with pytest.raises(UnboundLocalError):
        kmeans_clustering(X, k=2, max_iter=0)  # 14.2μs -> 12.2μs (16.4% faster)


def test_all_identical_points_with_k_more_than_one():
    """
    Edge case: all data points identical, but k > 1.
    Expectation: function still returns centroids and labels; at least one centroid equals the data point,
                 all samples are assigned to some valid label (often 0).
    """
    X = np.tile(np.array([[1.0, 1.0]]), (10, 1))  # 10 identical points
    np.random.seed(2)
    centroids, labels = kmeans_clustering(
        X, k=3, max_iter=50
    )  # 132μs -> 68.0μs (95.3% faster)

    # At least one centroid should equal the identical point
    any_equal = any(np.allclose(row, np.array([1.0, 1.0])) for row in centroids)


def test_zero_feature_dimension():
    """
    Edge case: n_features == 0 (X has shape (n_samples, 0)).
    Expectation: algorithm should not crash; centroids have shape (k, 0) and labels length n_samples.
    """
    X = np.empty((5, 0))  # five samples, zero features
    np.random.seed(5)
    centroids, labels = kmeans_clustering(
        X, k=2, max_iter=10
    )  # 35.2μs -> 38.8μs (9.14% slower)


# Large-scale tests (kept under 1000 elements)


def test_large_scale_cluster_recovery():
    """
    Large-scale test: generate many samples around multiple centers and confirm kmeans recovers centers roughly.
    - Use 500 samples and 5 clusters (keeps data under 1000 elements).
    - Use a controlled random seed for determinism.
    - Accept moderate tolerance since clustering may not be exact.
    """
    rng = np.random.RandomState(12345)
    true_centers = np.array(
        [
            [0.0, 0.0],
            [5.0, 0.0],
            [0.0, 5.0],
            [5.0, 5.0],
            [2.5, 2.5],
        ]
    )
    points_per_cluster = 100  # 5 * 100 = 500 total samples
    X_parts = []
    for c in true_centers:
        # small Gaussian spread around each center
        X_parts.append(rng.normal(loc=c, scale=0.4, size=(points_per_cluster, 2)))
    X = np.vstack(X_parts)

    np.random.seed(42)
    centroids, labels = kmeans_clustering(
        X, k=5, max_iter=200
    )  # 19.9ms -> 672μs (2854% faster)

    # For each found centroid, measure distance to nearest true center
    nearest_true_dists = []
    for c in centroids:
        dists = np.sqrt(np.sum((true_centers - c) ** 2, axis=1))
        nearest_true_dists.append(np.min(dists))
    nearest_true_dists = np.array(nearest_true_dists)

    # On average the centroids should be close to the true centers (mean distance < 1.0)
    mean_dist = float(np.mean(nearest_true_dists))


def test_output_shapes_and_label_range_for_random_data():
    """
    Sanity check on random data to ensure consistent output shapes and label ranges.
    """
    rng = np.random.RandomState(0)
    X = rng.uniform(-10, 10, size=(200, 4))
    np.random.seed(7)
    centroids, labels = kmeans_clustering(
        X, k=4, max_iter=100
    )  # 35.7ms -> 1.26ms (2729% faster)


# Extra: ensure algorithm handles float and integer input types consistently


def test_integer_input_casting_behavior():
    """
    The algorithm should work when X is of integer dtype. It should perform computations in float via numpy promotion.
    """
    X_int = np.array([[0, 0], [10, 10], [10, 11], [9, 9]], dtype=int)
    np.random.seed(10)
    centroids, labels = kmeans_clustering(
        X_int, k=2, max_iter=50
    )  # 69.6μs -> 74.3μs (6.33% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np
import pytest
from src.statistics.clustering import kmeans_clustering


# Basic Test Cases
def test_basic_clustering_with_simple_data():
    """Test basic clustering with simple, well-separated data points."""
    # Create simple data with 2 clear clusters
    X = np.array([[1.0, 1.0], [1.5, 1.5], [8.0, 8.0], [8.5, 8.5]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 61.0μs -> 68.0μs (10.4% slower)


def test_clustering_with_single_cluster():
    """Test clustering when k=1 (all points in one cluster)."""
    X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    k = 1
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 43.4μs -> 54.0μs (19.6% slower)


def test_clustering_with_identical_points():
    """Test clustering when all points are identical."""
    X = np.array([[2.0, 3.0], [2.0, 3.0], [2.0, 3.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 50.7μs -> 58.6μs (13.6% slower)


def test_clustering_with_k_equals_n_samples():
    """Test clustering when k equals the number of samples."""
    X = np.array([[1.0, 1.0], [2.0, 2.0], [3.0, 3.0]])
    k = 3
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 39.2μs -> 50.5μs (22.3% slower)
    # All clusters should be represented
    unique_labels = np.unique(labels)


def test_clustering_returns_valid_labels():
    """Test that labels are within valid range."""
    X = np.random.RandomState(42).randn(20, 3)
    k = 4
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 889μs -> 222μs (300% faster)


def test_clustering_converges():
    """Test that clustering converges with sufficient iterations."""
    X = np.array([[1.0, 1.0], [1.2, 1.1], [8.0, 8.0], [8.1, 7.9]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 60.0μs -> 67.5μs (11.2% slower)

    # Run again with same data to check consistency
    centroids2, labels2 = kmeans_clustering(
        X, k, max_iter=100
    )  # 53.8μs -> 56.4μs (4.66% slower)


# Edge Test Cases
def test_clustering_with_single_sample():
    """Test clustering with only one data point."""
    X = np.array([[1.0, 2.0, 3.0]])
    k = 1
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 23.9μs -> 33.8μs (29.3% slower)


def test_clustering_with_one_iteration():
    """Test clustering with max_iter=1."""
    X = np.array([[1.0, 1.0], [2.0, 2.0], [9.0, 9.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=1
    )  # 39.5μs -> 48.5μs (18.6% slower)


def test_clustering_with_high_dimensional_data():
    """Test clustering with high-dimensional data (many features)."""
    X = np.random.RandomState(42).randn(10, 50)
    k = 3
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 1.47ms -> 95.5μs (1438% faster)


def test_clustering_with_negative_values():
    """Test clustering with negative coordinate values."""
    X = np.array([[-5.0, -3.0], [-4.0, -2.0], [5.0, 3.0], [4.0, 2.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 61.8μs -> 69.0μs (10.5% slower)


def test_clustering_with_very_small_values():
    """Test clustering with very small coordinate values."""
    X = np.array([[1e-6, 1e-6], [2e-6, 2e-6], [1e-5, 1e-5]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 70.2μs -> 91.6μs (23.3% slower)


def test_clustering_with_very_large_values():
    """Test clustering with very large coordinate values."""
    X = np.array([[1e6, 1e6], [1.1e6, 1.1e6], [2e6, 2e6]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 69.5μs -> 90.4μs (23.1% slower)


def test_clustering_with_mixed_magnitude_values():
    """Test clustering with mixed magnitude values."""
    X = np.array([[0.001, 1000.0], [0.002, 1001.0], [1000.0, 0.001]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 69.6μs -> 90.7μs (23.3% slower)


def test_clustering_with_repeated_identical_rows():
    """Test clustering with many repeated identical rows."""
    X = np.array([[1.0, 2.0]] * 5 + [[9.0, 10.0]] * 5)
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 61.5μs -> 43.9μs (40.1% faster)


def test_clustering_with_outliers():
    """Test clustering with outliers far from main clusters."""
    X = np.array([[1.0, 1.0], [1.5, 1.5], [2.0, 2.0], [100.0, 100.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 37.8μs -> 42.3μs (10.7% slower)


def test_clustering_deterministic_with_same_seed():
    """Test that clustering gives same results with same random seed."""
    np.random.seed(123)
    X = np.random.randn(15, 3)
    k = 3

    np.random.seed(123)
    centroids1, labels1 = kmeans_clustering(
        X, k, max_iter=100
    )  # 231μs -> 86.7μs (167% faster)

    np.random.seed(123)
    centroids2, labels2 = kmeans_clustering(
        X, k, max_iter=100
    )  # 225μs -> 73.9μs (205% faster)


# Large Scale Test Cases
def test_clustering_with_large_dataset():
    """Test clustering with a large number of samples."""
    # Create dataset with 500 samples and 5 features
    np.random.seed(42)
    X = np.random.randn(500, 5)
    k = 5
    centroids, labels = kmeans_clustering(
        X, k, max_iter=50
    )  # 124ms -> 2.70ms (4510% faster)


def test_clustering_with_many_clusters():
    """Test clustering with many clusters."""
    # Create dataset with 300 samples and request 20 clusters
    np.random.seed(42)
    X = np.random.randn(300, 4)
    k = 20
    centroids, labels = kmeans_clustering(
        X, k, max_iter=50
    )  # 150ms -> 3.94ms (3735% faster)


def test_clustering_performance_with_iterations():
    """Test that clustering completes within reasonable time with many iterations."""
    # Create moderate-sized dataset
    np.random.seed(42)
    X = np.random.randn(200, 5)
    k = 8

    # Should complete with 100 iterations
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 33.6ms -> 1.00ms (3254% faster)


def test_clustering_with_many_features():
    """Test clustering with many features (high dimensionality)."""
    # Create dataset with many features
    np.random.seed(42)
    X = np.random.randn(100, 50)
    k = 5
    centroids, labels = kmeans_clustering(
        X, k, max_iter=50
    )  # 41.5ms -> 423μs (9698% faster)


def test_clustering_stability_across_runs():
    """Test that clustering produces stable results across multiple runs."""
    np.random.seed(42)
    X = np.random.randn(100, 4)
    k = 5

    # Run clustering multiple times with different seeds
    all_centroid_shapes = []
    all_label_shapes = []

    for seed in range(5):
        np.random.seed(seed)
        centroids, labels = kmeans_clustering(
            X, k, max_iter=100
        )  # 47.8ms -> 2.53ms (1790% faster)
        all_centroid_shapes.append(centroids.shape)
        all_label_shapes.append(labels.shape)

    # All runs should produce same shapes
    for shape in all_centroid_shapes:
        pass
    for shape in all_label_shapes:
        pass


def test_clustering_output_types():
    """Test that clustering returns correct output types."""
    X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 53.9μs -> 68.1μs (20.9% slower)


def test_clustering_centroid_values_in_data_range():
    """Test that centroids fall within or near the data range."""
    X = np.array([[1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [10.0, 10.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 38.9μs -> 42.9μs (9.33% slower)

    # All centroid coordinates should be within reasonable range
    # (between min and max of data or slightly outside due to convergence)
    X_min = np.min(X)
    X_max = np.max(X)
    # Allow some margin
    margin = (X_max - X_min) * 0.5


def test_clustering_with_float32_data():
    """Test clustering with float32 data type."""
    X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], dtype=np.float32)
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 63.2μs -> 76.0μs (16.9% slower)


def test_clustering_with_float64_data():
    """Test clustering with float64 data type."""
    X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], dtype=np.float64)
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 53.5μs -> 67.5μs (20.7% slower)


def test_clustering_labels_cover_expected_clusters():
    """Test that clustering uses labels up to k when possible."""
    X = np.array([[1.0, 1.0], [2.0, 2.0], [10.0, 10.0], [11.0, 11.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 60.3μs -> 68.0μs (11.3% slower)

    # Number of unique labels should be reasonable
    unique_labels = np.unique(labels)


def test_clustering_all_samples_assigned():
    """Test that all samples are assigned to a cluster."""
    X = np.random.RandomState(42).randn(50, 3)
    k = 5
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 2.55ms -> 287μs (787% faster)


def test_clustering_with_very_few_features():
    """Test clustering with single feature (1D data)."""
    X = np.array([[1.0], [2.0], [3.0], [10.0]])
    k = 2
    centroids, labels = kmeans_clustering(
        X, k, max_iter=100
    )  # 35.2μs -> 46.5μs (24.3% slower)


def test_clustering_reproducibility_with_numpy_seed():
    """Test reproducibility by setting numpy random seed."""
    np.random.seed(999)
    X = np.random.randn(50, 3)

    np.random.seed(999)
    centroids1, labels1 = kmeans_clustering(
        X, k=3, max_iter=50
    )  # 3.71ms -> 414μs (795% faster)

    np.random.seed(999)
    centroids2, labels2 = kmeans_clustering(
        X, k=3, max_iter=50
    )  # 3.70ms -> 392μs (844% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from src.statistics.clustering import kmeans_clustering

To edit these changes git checkout codeflash/optimize-kmeans_clustering-mkafkgr5 and push.

Codeflash Static Badge

The optimized code achieves a **28x speedup** (2821%) by replacing nested Python loops with vectorized NumPy operations. Here's why it's faster:

## Key Optimizations

**1. Vectorized Distance Calculation**
- **Original**: Triple-nested loops compute distances one feature at a time for each sample-centroid pair
  - Inner loop iterates over features: 23.7% of runtime
  - Distance accumulation: 36% of runtime  
  - Square root calls: 10.6% of runtime
- **Optimized**: Broadcasting computes all distances at once
  ```python
  differences = X[:, np.newaxis, :] - centroids[np.newaxis, :, :]
  distances = np.linalg.norm(differences, axis=2)
  ```
  This creates a 3D array of differences and computes norms in one vectorized operation, eliminating ~70% of the original runtime spent in nested loops.

**2. Vectorized Label Assignment**
- **Original**: Loop through samples finding minimum distance (5.1% + 4.4% overhead)
- **Optimized**: `labels = np.argmin(distances, axis=1)` finds all minimum distances in one operation

**3. Vectorized Centroid Updates**
- **Original**: Nested loops accumulate sums and manually divide by counts (9.7% of runtime)
- **Optimized**: Boolean masking and `mean()` operation
  ```python
  mask = labels == j
  new_centroids[j] = X[mask].mean(axis=0)
  ```
  NumPy's optimized C implementations handle aggregation much faster than Python loops.

## Performance Impact

The optimization excels with larger datasets:
- **Small data** (single points, k=1): 19-29% slower due to NumPy overhead
- **Medium data** (50-100 samples): 300-1000% faster
- **Large data** (500+ samples, high dimensions): 2800-9700% faster

The line profiler shows the optimized version spends most time (38.5%) on the `X[mask].mean()` operation and distance calculations (23.2%), both of which are highly optimized C operations. The original spent 60%+ of time on raw Python loops for distance calculations.

This optimization is particularly valuable in hot paths where k-means runs repeatedly (hyperparameter tuning, batch processing) or with high-dimensional data, as evidenced by the massive speedups in large-scale test cases.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 January 12, 2026 00:34
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant