Skip to content

utkarsh-284/Credit-Default-Risk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Credit Default Risk Analysis

Overview

This project analyzes credit card default risk using machine learning techniques on a dataset of 30,000 Taiwanese credit card clients. The analysis includes comprehensive exploratory data analysis, statistical modeling, and machine learning approaches to predict credit card defaults and identify key risk factors.

Key Findings

🎯 Best Model Performance

  • Support Vector Machine (SVM): 80.5% accuracy, 0.70 AUC-ROC score
  • Ensemble Methods: Voting Classifier achieved 80.5% accuracy
  • Neural Network: Deep learning model achieved 80.0% accuracy

πŸ“Š Critical Risk Factors

  1. Recent Payment Status (PAY_0): Most predictive feature (coefficient = 0.515)
  2. Payment History (PAY_2): Secondary predictor (coefficient = 0.111)
  3. Bill Amount (BILL_AMT1): Financial capacity indicator
  4. Credit Limit (LIMIT_BAL): Risk tolerance measure
  5. Age Demographics: U-shaped risk curve (highest for young adults and seniors)

πŸ” Demographic Insights

  • Gender: Females take more loans but have lower default rates
  • Education: University graduates borrow most, but all education levels show similar default proportions
  • Age: Peak borrowing at 25-30 years, with highest default risk for young adults and seniors
  • Marital Status: Singles borrow more but have higher default rates than married individuals

Dataset Details

Source: UCI Machine Learning Repository

Variables:

  • Demographic Features: Credit limit (LIMIT_BAL), gender (SEX), education (EDUCATION), marital status (MARRIAGE), age (AGE)
  • Payment History: Repayment status for past 6 months (PAY_0 to PAY_6)
  • Billing Statements: Bill amounts for past 6 months (BILL_AMT1 to BILL_AMT6)
  • Payment Amounts: Amounts paid in past 6 months (PAY_AMT1 to PAY_AMT6)
  • Target Variable: DEFAULT (binary indicator: 1 = default, 0 = non-default)

Methodology

Data Preprocessing

  • Missing Values: No missing values detected
  • Categorical Variables:
    • EDUCATION: Categories 0, 5, 6 merged into "Others"
    • MARRIAGE: Category 0 merged into "Others"
  • Feature Engineering:
    • Created TOTAL_BILL_AMT (sum of all bill amounts)
    • Created TOTAL_PAY_AMT (sum of all payment amounts)
  • Outlier Removal: Applied IQR method, reducing dataset from 30,000 to 25,174 observations

Exploratory Data Analysis (EDA)

  • Comprehensive demographic analysis revealing risk patterns
  • Financial behavior correlation analysis
  • Age-based risk segmentation
  • Payment history impact assessment

Statistical Analysis

  • Logistic Regression: Pseudo R-squared = 0.121 (12.1% variance explained)
  • Key Statistical Findings:
    • PAY_0 highly significant (p < 0.001)
    • PAY_2 significant (p < 0.001)
    • BILL_AMT1 significant (p < 0.001)

Machine Learning Models Tested

Model Accuracy AUC-ROC Performance Rank
Support Vector Machine 80.5% 0.70 πŸ₯‡ Best
Voting Classifier 80.5% - πŸ₯ˆ Second
Random Forest 80.0% - πŸ₯‰ Third
Deep Neural Network 80.0% - πŸ₯‰ Third
Logistic Regression 79.3% - 4th
Lasso Regression 76.2% - 5th
Elastic Net 76.3% - 6th

Business Impact

Risk Management Applications

  1. Early Warning System: Monitor recent payment status for immediate intervention
  2. Credit Limit Optimization: Adjust limits based on risk scores
  3. Customer Segmentation: Target high-risk demographics proactively
  4. Collection Strategy: Prioritize collection efforts based on risk assessment

Operational Benefits

  • Proactive Risk Mitigation: Identify high-risk clients before default
  • Reduced Financial Losses: Targeted interventions based on risk scores
  • Improved Customer Retention: Personalized risk-based strategies
  • Optimized Credit Policies: Data-driven lending decisions

Model Performance Analysis

Support Vector Machine (Best Model)

  • Accuracy: 80.5%
  • AUC-ROC: 0.70
  • Advantages:
    • Excellent for binary classification
    • Handles non-linear relationships
    • Robust to outliers
  • Business Value: Highest predictive power for default risk identification

Feature Importance Ranking

  1. PAY_0 (Most recent payment status) - Primary predictor
  2. PAY_2 (2nd most recent payment status) - Secondary predictor
  3. BILL_AMT1 (Most recent bill amount) - Financial capacity indicator
  4. LIMIT_BAL (Credit limit) - Risk tolerance indicator
  5. Age-related features - Life stage risk factors

Technical Implementation

Dependencies

Python 3.7+
pandas, numpy, matplotlib, seaborn
scikit-learn, statsmodels
tensorflow (for neural networks)

Key Libraries Used

  • Data Processing: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn
  • Statistical Analysis: Statsmodels
  • Machine Learning: Scikit-learn
  • Deep Learning: TensorFlow

Future Work

Model Enhancements

  • Advanced Algorithms: XGBoost, LightGBM implementation
  • Class Imbalance: SMOTE or cost-sensitive learning
  • Feature Engineering: Payment-to-bill ratios, temporal trends
  • Cross-validation: Robust model validation

Data Improvements

  • Additional Features: Macroeconomic indicators, employment data
  • Real-time Data: Live transactional data integration
  • Geographic Expansion: Multi-market validation
  • Temporal Analysis: Time series modeling

Deployment Considerations

  • API Development: Real-time prediction endpoints
  • Dashboard Creation: Risk visualization interface
  • Model Monitoring: Performance tracking and drift detection
  • Scalability: Cloud deployment for production use

Files Structure

Credit Risk Analysis/
β”œβ”€β”€ README.md                           # Project overview and documentation
β”œβ”€β”€ MODEL_REPORT.md                     # Comprehensive model analysis report
β”œβ”€β”€ Credit_Default_Prediction.ipynb     # Complete Jupyter notebook analysis
└── content/
    β”œβ”€β”€ default of credit card clients.xls    # Original dataset
    └── default+of+credit+card+clients.zip    # Compressed dataset

Quick Start

  1. Clone the repository
  2. Install dependencies: pip install pandas numpy matplotlib seaborn scikit-learn statsmodels tensorflow
  3. Open the Jupyter notebook: Credit_Default_Prediction.ipynb
  4. Run the analysis: Execute cells sequentially for complete analysis

Results Summary

Model Performance

  • Best Model: Support Vector Machine (80.5% accuracy)
  • Key Predictor: Recent payment status (PAY_0)
  • Business Impact: Proactive risk identification and mitigation

Key Insights

  • Recent payment behavior is the strongest default predictor
  • Age shows U-shaped risk pattern (highest for young adults and seniors)
  • Higher credit limits correlate with lower default rates
  • Gender and education show nuanced risk relationships

Contributor

Utkarsh Bhardwaj
Publish Date: February 28, 2025
Contact: ubhardwaj284@gmail.com

LinkedIn | GitHub


Note: This analysis provides a solid foundation for credit risk assessment. The SVM model with 80.5% accuracy can be effectively deployed for real-time default prediction and risk management applications.