-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Problem
When using models.QRF with sequential imputation for a large number of variables (e.g., 84 variables), the process runs out of memory on CI runners. By the time it reaches variable 70+, there are 80+ features being used as predictors, causing memory issues.
Current Behavior
- Sequential imputation adds previously imputed variables as predictors
- No memory management between variable imputations
- Entire model is kept in memory for all variables
- Default parameters (n_estimators=100) may be too memory-intensive for large sequential imputations
Suggested Solutions
- Add automatic batching option: When imputing many variables, automatically batch them into smaller groups
- Add memory management: Force garbage collection between variable imputations
- Add sampling option: Allow automatic sampling of training data when it exceeds a threshold
- Add reduced complexity mode: Provide parameters for memory-constrained environments
Example workaround currently needed in client code:
# Batch variables to avoid memory issues
batch_size = 20
for batch_start in range(0, len(variables), batch_size):
batch_vars = variables[batch_start:batch_start+batch_size]
qrf = QRF()
fitted_model = qrf.fit(
X_train=data,
predictors=base_predictors,
imputed_variables=batch_vars,
n_estimators=30, # Reduced from default
max_depth=8, # Reduced from default
)
gc.collect() # Manual memory cleanupImpact
This affects any use case with many variables to impute, particularly in CI/CD environments with limited memory.
Metadata
Metadata
Assignees
Labels
No labels