Credit Scoring ML Project

Predicting Serious Delinquency (90+ days past due) using the “Give Me Some Credit” dataset

A clean, production-ready machine learning project that demonstrates best practices: modular code, configuration-driven training, experiment tracking with MLflow, SHAP interpretability, statistical significance testing, and full reproducibility.

Overview

Component	Technology / Approach
Data	Kaggle “Give Me Some Credit” (2011)
Target	`SeriousDlqin2yrs` – binary (7 % positive class
Preprocessing	Custom sklearn-compatible transformer + ColumnTransformer
Imputation	IterativeImputer (MICE) guided by tree-based feature importance
Models	Logistic Regression (interpretable baseline) Random Forest XGBoost (best performance)
Hyper-parameter tuning	GridSearchCV + RandomizedSearchCV (in notebooks)
Evaluation	ROC-AUC (primary), F1 with optimal threshold, statistical significance (bootstrap)
Interpretability	Coefficients, feature importance, SHAP (TreeExplainer)
Experiment tracking	MLflow (parameters, metrics, models, artifacts)
Packaging	`src/` package, `requirements.txt`, virtual environment
CLI	`python src/train.py`, `python src/evaluate.py`

Project Structure

credit-scoring/
├── config.yaml                     # All hyper-parameters & paths
├── requirements.txt
├── data/
│   └── raw/
│       ├── cs-training.csv
│       └── cs-test.csv
├── models/
│   ├── trained_models/             # .pkl files (gitignored)
│   └── preprocess_artifacts/      # cached imputers scalers (gitignored)
├── reports/
│   └── figures/                    # SHAP plots, coefficient charts
├── src/
│   ├── __init__.py
│   ├── data.py                     # load_data()
│   ├── features.py                 # preprocessing pipeline + CustomPreprocessor
│   ├── models.py                   # model factory (logistic, rf, xgb, ensemble)
│   ├── train.py                    # training script with MLflow logging
│   └── evaluate.py                 # inference, metrics, SHAP generation
├── notebooks/
│   └── 03_modeling.ipynb          # exploratory modeling, tuning, significance tests
└── README.md

Quick Start

# 1. Clone and enter the project
git clone https://github.com/yourname/credit-scoring.git
cd credit-scoring

# 2. Create and activate virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Train models (example: XGBoost)
python src/train.py --model xgb

# Other models
python src/train.py --model logistic
python src/train.py --model rf
python src/train.py --model ensemble

# 5. Evaluate any trained model (generates SHAP plots reports)
python src/evaluate.py --model xgb --data_path data/raw/cs-training.csv

# 6. View experiments
mlflow ui   # open http://localhost:5000

Results (as of latest run)

Model	CV ROC-AUC	Val ROC-AUC	Best F1	Notes
Logistic (tuned)	0.848	0.855	0.429	Fully interpretable coefficients
Random Forest	0.858	0.864	0.434	Good balance of performance simplicity
XGBoost	0.866	0.871	0.451	Best predictive power

Statistical bootstrap test (1 000 resamples) confirms that XGBoost is significantly better than Random Forest (p < 0.001).

Interpretability

Logistic Regression coefficients → direct business rules
Random Forest / XGBoost feature importances → top risk drivers
SHAP summary and force plots are automatically saved to reports/figures/

Reproducibility

Fixed random_state = 42 everywhere
All hyper-parameters are stored in config.yaml
Models preprocessing artifacts are versioned via MLflow
requirements.txt pins exact library versions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Credit Scoring ML Project

Overview

Project Structure

Quick Start

Results (as of latest run)

Interpretability

Reproducibility

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

theMagusDev/credit-scoring

Folders and files

Latest commit

History

Repository files navigation

Credit Scoring ML Project

Overview

Project Structure

Quick Start

Results (as of latest run)

Interpretability

Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages