Predicting Serious Delinquency (90+ days past due) using the “Give Me Some Credit” dataset
A clean, production-ready machine learning project that demonstrates best practices: modular code, configuration-driven training, experiment tracking with MLflow, SHAP interpretability, statistical significance testing, and full reproducibility.
| Component | Technology / Approach |
|---|---|
| Data | Kaggle “Give Me Some Credit” (2011) |
| Target | SeriousDlqin2yrs – binary (7 % positive class |
| Preprocessing | Custom sklearn-compatible transformer + ColumnTransformer |
| Imputation | IterativeImputer (MICE) guided by tree-based feature importance |
| Models | Logistic Regression (interpretable baseline) Random Forest XGBoost (best performance) |
| Hyper-parameter tuning | GridSearchCV + RandomizedSearchCV (in notebooks) |
| Evaluation | ROC-AUC (primary), F1 with optimal threshold, statistical significance (bootstrap) |
| Interpretability | Coefficients, feature importance, SHAP (TreeExplainer) |
| Experiment tracking | MLflow (parameters, metrics, models, artifacts) |
| Packaging | src/ package, requirements.txt, virtual environment |
| CLI | python src/train.py, python src/evaluate.py |
credit-scoring/
├── config.yaml # All hyper-parameters & paths
├── requirements.txt
├── data/
│ └── raw/
│ ├── cs-training.csv
│ └── cs-test.csv
├── models/
│ ├── trained_models/ # .pkl files (gitignored)
│ └── preprocess_artifacts/ # cached imputers scalers (gitignored)
├── reports/
│ └── figures/ # SHAP plots, coefficient charts
├── src/
│ ├── __init__.py
│ ├── data.py # load_data()
│ ├── features.py # preprocessing pipeline + CustomPreprocessor
│ ├── models.py # model factory (logistic, rf, xgb, ensemble)
│ ├── train.py # training script with MLflow logging
│ └── evaluate.py # inference, metrics, SHAP generation
├── notebooks/
│ └── 03_modeling.ipynb # exploratory modeling, tuning, significance tests
└── README.md
# 1. Clone and enter the project
git clone https://github.com/yourname/credit-scoring.git
cd credit-scoring
# 2. Create and activate virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Train models (example: XGBoost)
python src/train.py --model xgb
# Other models
python src/train.py --model logistic
python src/train.py --model rf
python src/train.py --model ensemble
# 5. Evaluate any trained model (generates SHAP plots reports)
python src/evaluate.py --model xgb --data_path data/raw/cs-training.csv
# 6. View experiments
mlflow ui # open http://localhost:5000| Model | CV ROC-AUC | Val ROC-AUC | Best F1 | Notes |
|---|---|---|---|---|
| Logistic (tuned) | 0.848 | 0.855 | 0.429 | Fully interpretable coefficients |
| Random Forest | 0.858 | 0.864 | 0.434 | Good balance of performance simplicity |
| XGBoost | 0.866 | 0.871 | 0.451 | Best predictive power |
Statistical bootstrap test (1 000 resamples) confirms that XGBoost is significantly better than Random Forest (p < 0.001).
- Logistic Regression coefficients → direct business rules
- Random Forest / XGBoost feature importances → top risk drivers
- SHAP summary and force plots are automatically saved to
reports/figures/
- Fixed
random_state = 42everywhere - All hyper-parameters are stored in
config.yaml - Models preprocessing artifacts are versioned via MLflow
requirements.txtpins exact library versions