Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 30, 2025

📄 16% (0.16x) speedup for get_pandas_reader in nvflare/app_opt/sklearn/data_loader.py

⏱️ Runtime : 29.6 milliseconds 25.6 milliseconds (best of 35 runs)

📝 Explanation and details

The optimization achieves a 15% speedup by eliminating expensive function call chains and import overhead.

Key optimizations:

  1. Removed function call indirection: The original code had a 3-function chain (get_pandas_readerget_file_formatget_file_ext + get_ext_format). The optimized version inlines all file extension parsing logic directly in get_pandas_reader, eliminating two function calls per invocation.

  2. Direct pathlib usage: Instead of calling helper functions that wrap pathlib.Path(input_path).suffix, the optimized code calls this directly, avoiding function call overhead. The line profiler shows the original get_file_format took 90.7% of its time in get_file_ext().

  3. Localized pd_readers dictionary: Moving the pd_readers dictionary inside the function eliminates global variable lookups and allows the Python interpreter to optimize local variable access.

  4. Reduced import overhead: The original code imported from nvflare.app_common.utils.file_utils on every call. The optimized version uses direct imports of pathlib and pandas, which are more efficient.

Performance benefits by test case type:

  • Single file operations: 7-30% faster due to eliminated function calls
  • Batch operations: 15-16% faster, showing consistent overhead reduction
  • Edge cases (empty strings, no extensions): 10-19% faster, particularly benefiting from streamlined default handling

The optimization is most effective for workloads with frequent file format detection, where the eliminated function call overhead compounds significantly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 8885 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import os  # for file path manipulations
# function to test (copied here for self-containment)
import pathlib
import tempfile  # for creating temp files
from typing import Optional

import pandas as pd
# imports
import pytest  # used for our unit tests
from nvflare.app_opt.sklearn.data_loader import get_pandas_reader

pd_readers = {
    "csv": pd.read_csv,
    "xls": pd.read_excel,
    "xlsx": pd.read_excel,
}
from nvflare.app_opt.sklearn.data_loader import get_pandas_reader

# unit tests

# ----------- Basic Test Cases -----------

def test_csv_extension_lowercase():
    # Should return pd.read_csv for .csv extension
    codeflash_output = get_pandas_reader("data.csv"); reader = codeflash_output # 9.46μs -> 8.40μs (12.5% faster)


def test_xls_extension():
    # Should return pd.read_excel for .xls extension
    codeflash_output = get_pandas_reader("data.xls"); reader = codeflash_output # 13.3μs -> 11.7μs (13.0% faster)

def test_xlsx_extension():
    # Should return pd.read_excel for .xlsx extension
    codeflash_output = get_pandas_reader("data.xlsx"); reader = codeflash_output # 10.5μs -> 9.43μs (11.4% faster)

def test_extension_with_dot():
    # Should handle extension with dot correctly
    codeflash_output = get_pandas_reader("data..csv"); reader = codeflash_output # 10.1μs -> 8.98μs (12.6% faster)

def test_file_with_multiple_dots():
    # Should handle files with multiple dots correctly
    codeflash_output = get_pandas_reader("archive.backup.data.csv"); reader = codeflash_output # 10.1μs -> 9.10μs (11.1% faster)

def test_file_with_no_extension_defaults_to_csv():
    # Should default to csv if no extension
    codeflash_output = get_pandas_reader("data"); reader = codeflash_output # 8.74μs -> 7.42μs (17.8% faster)

def test_file_with_trailing_dot():
    # Should default to csv if trailing dot (no extension)
    codeflash_output = get_pandas_reader("data."); reader = codeflash_output # 9.07μs -> 8.23μs (10.2% faster)

def test_file_with_spaces_in_extension():
    # Should default to csv if extension is only spaces
    codeflash_output = get_pandas_reader("data.   "); reader = codeflash_output # 9.53μs -> 8.83μs (7.95% faster)

# ----------- Edge Test Cases -----------

def test_file_with_unrecognized_extension_raises():
    # Should raise ValueError for unknown extension
    with pytest.raises(ValueError):
        get_pandas_reader("data.unknown") # 10.8μs -> 9.44μs (14.2% faster)



def test_file_with_leading_dot_in_name():
    # Should handle hidden files with extension
    codeflash_output = get_pandas_reader(".hidden.csv"); reader = codeflash_output # 13.3μs -> 11.4μs (16.6% faster)

def test_file_with_leading_dot_and_no_extension():
    # Should default to csv for hidden files with no extension
    codeflash_output = get_pandas_reader(".hidden"); reader = codeflash_output # 9.58μs -> 8.67μs (10.6% faster)

def test_file_with_multiple_consecutive_dots():
    # Should handle multiple consecutive dots
    codeflash_output = get_pandas_reader("data...csv"); reader = codeflash_output # 9.97μs -> 9.24μs (7.91% faster)

def test_file_with_empty_string():
    # Should default to csv if input path is empty string
    codeflash_output = get_pandas_reader(""); reader = codeflash_output # 7.80μs -> 6.65μs (17.2% faster)


def test_file_with_dot_only():
    # Should default to csv if input path is "."
    codeflash_output = get_pandas_reader("."); reader = codeflash_output # 13.3μs -> 11.6μs (14.9% faster)

def test_file_with_pathlib_path_object():
    # Should handle pathlib.Path objects as input
    path = pathlib.Path("data.csv")
    codeflash_output = get_pandas_reader(str(path)); reader = codeflash_output # 8.45μs -> 7.64μs (10.6% faster)


def test_large_number_of_csv_files():
    # Test scalability by running get_pandas_reader on many file names
    for i in range(1000):
        fname = f"file_{i}.csv"
        codeflash_output = get_pandas_reader(fname); reader = codeflash_output # 3.22ms -> 2.79ms (15.7% faster)

def test_large_number_of_xlsx_files():
    # Test scalability by running get_pandas_reader on many xlsx file names
    for i in range(1000):
        fname = f"file_{i}.xlsx"
        codeflash_output = get_pandas_reader(fname); reader = codeflash_output # 3.21ms -> 2.76ms (16.2% faster)

def test_large_number_of_files_with_mixed_extensions():
    # Test scalability with mixed extensions
    for i in range(500):
        fname_csv = f"file_{i}.csv"
        fname_xls = f"file_{i}.xls"
        fname_xlsx = f"file_{i}.xlsx"
        codeflash_output = get_pandas_reader(fname_csv) # 1.61ms -> 1.39ms (15.7% faster)
        codeflash_output = get_pandas_reader(fname_xls)
        codeflash_output = get_pandas_reader(fname_xlsx) # 1.60ms -> 1.38ms (15.5% faster)

def test_large_number_of_files_with_unrecognized_extensions():
    # Test that all files with unknown extension raise ValueError
    for i in range(100):
        fname = f"file_{i}.dat"
        with pytest.raises(ValueError):
            get_pandas_reader(fname)

# ----------- Mutation Testing Guards -----------

def test_mutation_guard_csv_vs_excel():
    # Ensure .csv does not return pd.read_excel and vice versa
    codeflash_output = get_pandas_reader("test.csv") # 9.74μs -> 9.01μs (8.16% faster)
    codeflash_output = get_pandas_reader("test.xlsx") # 4.98μs -> 4.48μs (11.3% faster)

def test_mutation_guard_value_error_message():
    # Ensure ValueError message contains the file format
    with pytest.raises(ValueError) as excinfo:
        get_pandas_reader("test.unknown") # 9.56μs -> 8.69μs (10.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pathlib
from typing import Optional

# imports
import pytest  # used for our unit tests
from nvflare.app_opt.sklearn.data_loader import get_pandas_reader


# Simulate pandas readers for test purposes
def fake_read_csv(*args, **kwargs):
    return "csv_reader"

def fake_read_excel(*args, **kwargs):
    return "excel_reader"

pd_readers = {
    "csv": fake_read_csv,
    "xls": fake_read_excel,
    "xlsx": fake_read_excel,
}
from nvflare.app_opt.sklearn.data_loader import \
    get_pandas_reader  # END: function to test

# --------------------
# UNIT TESTS BEGIN HERE
# --------------------

# 1. Basic Test Cases

def test_csv_extension_lowercase():
    # Should return csv reader for .csv file
    codeflash_output = get_pandas_reader("data.csv"); reader = codeflash_output # 10.3μs -> 9.08μs (13.2% faster)


def test_xls_extension():
    # Should return excel reader for .xls file
    codeflash_output = get_pandas_reader("data.xls"); reader = codeflash_output # 13.9μs -> 10.7μs (30.1% faster)

def test_xlsx_extension():
    # Should return excel reader for .xlsx file
    codeflash_output = get_pandas_reader("data.xlsx"); reader = codeflash_output # 10.6μs -> 9.74μs (8.37% faster)

def test_csv_extension_with_multiple_dots():
    # Should return csv reader for file with multiple dots
    codeflash_output = get_pandas_reader("archive.backup.data.csv"); reader = codeflash_output # 10.2μs -> 8.80μs (15.5% faster)

def test_csv_extension_with_leading_dot():
    # Should return csv reader even if extension is given as .csv
    codeflash_output = get_pandas_reader(".csv"); reader = codeflash_output # 9.46μs -> 8.34μs (13.5% faster)

def test_file_with_no_extension():
    # Should default to csv reader if no extension
    codeflash_output = get_pandas_reader("data"); reader = codeflash_output # 8.53μs -> 7.95μs (7.28% faster)

def test_file_with_empty_string():
    # Should default to csv reader if empty string is passed
    codeflash_output = get_pandas_reader(""); reader = codeflash_output # 7.98μs -> 6.74μs (18.4% faster)

# 2. Edge Test Cases

def test_file_with_whitespace_extension():
    # Should default to csv reader if extension is whitespace
    codeflash_output = get_pandas_reader("data. "); reader = codeflash_output # 9.93μs -> 8.93μs (11.2% faster)

def test_file_with_dot_only():
    # Should default to csv reader if only a dot is present
    codeflash_output = get_pandas_reader("."); reader = codeflash_output # 8.69μs -> 7.47μs (16.3% faster)

def test_file_with_unknown_extension():
    # Should raise ValueError for unknown extension
    with pytest.raises(ValueError) as excinfo:
        get_pandas_reader("data.unknown") # 11.4μs -> 9.80μs (16.2% faster)



def test_file_with_mixed_case_extension():
    # Should not match, so should raise ValueError
    with pytest.raises(ValueError):
        get_pandas_reader("data.CsV") # 12.9μs -> 12.0μs (7.97% faster)

def test_file_with_multiple_extensions():
    # Should use last extension
    codeflash_output = get_pandas_reader("data.backup.csv"); reader = codeflash_output # 11.1μs -> 9.35μs (18.4% faster)

def test_file_with_hidden_file():
    # Should default to csv reader if hidden file with no extension
    codeflash_output = get_pandas_reader(".hiddenfile"); reader = codeflash_output # 9.69μs -> 8.12μs (19.3% faster)

def test_file_with_path_object():
    # Should work with pathlib.Path as input
    codeflash_output = get_pandas_reader(str(pathlib.Path("data.csv"))); reader = codeflash_output # 7.87μs -> 6.96μs (13.1% faster)

def test_file_with_spaces_in_filename():
    # Should work with spaces in filename
    codeflash_output = get_pandas_reader("my data file.csv"); reader = codeflash_output # 9.69μs -> 9.29μs (4.20% faster)


def test_file_with_dot_in_folder_name():
    # Should handle dot in folder name, not just extension
    codeflash_output = get_pandas_reader("folder.name/data.csv"); reader = codeflash_output # 12.8μs -> 12.8μs (0.031% faster)

def test_file_with_multiple_dots_and_weird_extension():
    # Should raise ValueError for unknown extension
    with pytest.raises(ValueError):
        get_pandas_reader("data.backup.weird") # 11.5μs -> 10.4μs (10.4% faster)


def test_large_number_of_csv_files():
    # Should work for up to 1000 different csv file names
    for i in range(1000):
        filename = f"file_{i}.csv"
        codeflash_output = get_pandas_reader(filename); reader = codeflash_output # 3.21ms -> 2.77ms (15.7% faster)

def test_large_number_of_xlsx_files():
    # Should work for up to 1000 different xlsx file names
    for i in range(1000):
        filename = f"file_{i}.xlsx"
        codeflash_output = get_pandas_reader(filename); reader = codeflash_output # 3.19ms -> 2.76ms (15.8% faster)

def test_large_number_of_unknown_files():
    # Should raise ValueError for each unknown extension
    for i in range(1000):
        filename = f"file_{i}.abc"
        with pytest.raises(ValueError):
            get_pandas_reader(filename)

def test_large_mixed_case_extensions():
    # Should raise ValueError for each mixed-case extension
    for i in range(1000):
        filename = f"file_{i}.CsV"
        with pytest.raises(ValueError):
            get_pandas_reader(filename)

def test_large_scale_with_varied_extensions():
    # Should work for a mix of supported and unsupported extensions
    for i in range(250):
        codeflash_output = get_pandas_reader(f"file_{i}.csv") # 844μs -> 733μs (15.2% faster)
        codeflash_output = get_pandas_reader(f"file_{i}.xls")
        codeflash_output = get_pandas_reader(f"file_{i}.xlsx") # 811μs -> 698μs (16.2% faster)
        with pytest.raises(ValueError):
            get_pandas_reader(f"file_{i}.doc")
        with pytest.raises(ValueError):
            get_pandas_reader(f"file_{i}.pdf") # 827μs -> 711μs (16.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_pandas_reader-mhcszf46 and push.

Codeflash Static Badge

The optimization achieves a 15% speedup by **eliminating expensive function call chains and import overhead**. 

**Key optimizations:**

1. **Removed function call indirection**: The original code had a 3-function chain (`get_pandas_reader` → `get_file_format` → `get_file_ext` + `get_ext_format`). The optimized version inlines all file extension parsing logic directly in `get_pandas_reader`, eliminating two function calls per invocation.

2. **Direct pathlib usage**: Instead of calling helper functions that wrap `pathlib.Path(input_path).suffix`, the optimized code calls this directly, avoiding function call overhead. The line profiler shows the original `get_file_format` took 90.7% of its time in `get_file_ext()`.

3. **Localized pd_readers dictionary**: Moving the `pd_readers` dictionary inside the function eliminates global variable lookups and allows the Python interpreter to optimize local variable access.

4. **Reduced import overhead**: The original code imported from `nvflare.app_common.utils.file_utils` on every call. The optimized version uses direct imports of `pathlib` and `pandas`, which are more efficient.

**Performance benefits by test case type:**
- **Single file operations**: 7-30% faster due to eliminated function calls
- **Batch operations**: 15-16% faster, showing consistent overhead reduction
- **Edge cases** (empty strings, no extensions): 10-19% faster, particularly benefiting from streamlined default handling

The optimization is most effective for workloads with frequent file format detection, where the eliminated function call overhead compounds significantly.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 30, 2025 02:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant