Skip to content

Sequential imputation runs out of memory with many variables #96

@MaxGhenis

Description

@MaxGhenis

Problem

When using models.QRF with sequential imputation for a large number of variables (e.g., 84 variables), the process runs out of memory on CI runners. By the time it reaches variable 70+, there are 80+ features being used as predictors, causing memory issues.

Current Behavior

  • Sequential imputation adds previously imputed variables as predictors
  • No memory management between variable imputations
  • Entire model is kept in memory for all variables
  • Default parameters (n_estimators=100) may be too memory-intensive for large sequential imputations

Suggested Solutions

  1. Add automatic batching option: When imputing many variables, automatically batch them into smaller groups
  2. Add memory management: Force garbage collection between variable imputations
  3. Add sampling option: Allow automatic sampling of training data when it exceeds a threshold
  4. Add reduced complexity mode: Provide parameters for memory-constrained environments

Example workaround currently needed in client code:

# Batch variables to avoid memory issues
batch_size = 20
for batch_start in range(0, len(variables), batch_size):
    batch_vars = variables[batch_start:batch_start+batch_size]
    qrf = QRF()
    fitted_model = qrf.fit(
        X_train=data,
        predictors=base_predictors,
        imputed_variables=batch_vars,
        n_estimators=30,  # Reduced from default
        max_depth=8,      # Reduced from default
    )
    gc.collect()  # Manual memory cleanup

Impact

This affects any use case with many variables to impute, particularly in CI/CD environments with limited memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions