This project analyzes credit card default risk using machine learning techniques on a dataset of 30,000 Taiwanese credit card clients. The analysis includes comprehensive exploratory data analysis, statistical modeling, and machine learning approaches to predict credit card defaults and identify key risk factors.
- Support Vector Machine (SVM): 80.5% accuracy, 0.70 AUC-ROC score
- Ensemble Methods: Voting Classifier achieved 80.5% accuracy
- Neural Network: Deep learning model achieved 80.0% accuracy
- Recent Payment Status (PAY_0): Most predictive feature (coefficient = 0.515)
- Payment History (PAY_2): Secondary predictor (coefficient = 0.111)
- Bill Amount (BILL_AMT1): Financial capacity indicator
- Credit Limit (LIMIT_BAL): Risk tolerance measure
- Age Demographics: U-shaped risk curve (highest for young adults and seniors)
- Gender: Females take more loans but have lower default rates
- Education: University graduates borrow most, but all education levels show similar default proportions
- Age: Peak borrowing at 25-30 years, with highest default risk for young adults and seniors
- Marital Status: Singles borrow more but have higher default rates than married individuals
Source: UCI Machine Learning Repository
- Demographic Features: Credit limit (
LIMIT_BAL), gender (SEX), education (EDUCATION), marital status (MARRIAGE), age (AGE) - Payment History: Repayment status for past 6 months (
PAY_0toPAY_6) - Billing Statements: Bill amounts for past 6 months (
BILL_AMT1toBILL_AMT6) - Payment Amounts: Amounts paid in past 6 months (
PAY_AMT1toPAY_AMT6) - Target Variable: DEFAULT (binary indicator: 1 = default, 0 = non-default)
- Missing Values: No missing values detected
- Categorical Variables:
- EDUCATION: Categories 0, 5, 6 merged into "Others"
- MARRIAGE: Category 0 merged into "Others"
- Feature Engineering:
- Created
TOTAL_BILL_AMT(sum of all bill amounts) - Created
TOTAL_PAY_AMT(sum of all payment amounts)
- Created
- Outlier Removal: Applied IQR method, reducing dataset from 30,000 to 25,174 observations
- Comprehensive demographic analysis revealing risk patterns
- Financial behavior correlation analysis
- Age-based risk segmentation
- Payment history impact assessment
- Logistic Regression: Pseudo R-squared = 0.121 (12.1% variance explained)
- Key Statistical Findings:
- PAY_0 highly significant (p < 0.001)
- PAY_2 significant (p < 0.001)
- BILL_AMT1 significant (p < 0.001)
| Model | Accuracy | AUC-ROC | Performance Rank |
|---|---|---|---|
| Support Vector Machine | 80.5% | 0.70 | π₯ Best |
| Voting Classifier | 80.5% | - | π₯ Second |
| Random Forest | 80.0% | - | π₯ Third |
| Deep Neural Network | 80.0% | - | π₯ Third |
| Logistic Regression | 79.3% | - | 4th |
| Lasso Regression | 76.2% | - | 5th |
| Elastic Net | 76.3% | - | 6th |
- Early Warning System: Monitor recent payment status for immediate intervention
- Credit Limit Optimization: Adjust limits based on risk scores
- Customer Segmentation: Target high-risk demographics proactively
- Collection Strategy: Prioritize collection efforts based on risk assessment
- Proactive Risk Mitigation: Identify high-risk clients before default
- Reduced Financial Losses: Targeted interventions based on risk scores
- Improved Customer Retention: Personalized risk-based strategies
- Optimized Credit Policies: Data-driven lending decisions
- Accuracy: 80.5%
- AUC-ROC: 0.70
- Advantages:
- Excellent for binary classification
- Handles non-linear relationships
- Robust to outliers
- Business Value: Highest predictive power for default risk identification
- PAY_0 (Most recent payment status) - Primary predictor
- PAY_2 (2nd most recent payment status) - Secondary predictor
- BILL_AMT1 (Most recent bill amount) - Financial capacity indicator
- LIMIT_BAL (Credit limit) - Risk tolerance indicator
- Age-related features - Life stage risk factors
Python 3.7+
pandas, numpy, matplotlib, seaborn
scikit-learn, statsmodels
tensorflow (for neural networks)
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Statistical Analysis: Statsmodels
- Machine Learning: Scikit-learn
- Deep Learning: TensorFlow
- Advanced Algorithms: XGBoost, LightGBM implementation
- Class Imbalance: SMOTE or cost-sensitive learning
- Feature Engineering: Payment-to-bill ratios, temporal trends
- Cross-validation: Robust model validation
- Additional Features: Macroeconomic indicators, employment data
- Real-time Data: Live transactional data integration
- Geographic Expansion: Multi-market validation
- Temporal Analysis: Time series modeling
- API Development: Real-time prediction endpoints
- Dashboard Creation: Risk visualization interface
- Model Monitoring: Performance tracking and drift detection
- Scalability: Cloud deployment for production use
Credit Risk Analysis/
βββ README.md # Project overview and documentation
βββ MODEL_REPORT.md # Comprehensive model analysis report
βββ Credit_Default_Prediction.ipynb # Complete Jupyter notebook analysis
βββ content/
βββ default of credit card clients.xls # Original dataset
βββ default+of+credit+card+clients.zip # Compressed dataset
- Clone the repository
- Install dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn statsmodels tensorflow - Open the Jupyter notebook:
Credit_Default_Prediction.ipynb - Run the analysis: Execute cells sequentially for complete analysis
- Best Model: Support Vector Machine (80.5% accuracy)
- Key Predictor: Recent payment status (PAY_0)
- Business Impact: Proactive risk identification and mitigation
- Recent payment behavior is the strongest default predictor
- Age shows U-shaped risk pattern (highest for young adults and seniors)
- Higher credit limits correlate with lower default rates
- Gender and education show nuanced risk relationships
Utkarsh Bhardwaj
Publish Date: February 28, 2025
Contact: ubhardwaj284@gmail.com
Note: This analysis provides a solid foundation for credit risk assessment. The SVM model with 80.5% accuracy can be effectively deployed for real-time default prediction and risk management applications.