Comprehensive Data Analytics Project for Transportation Network Optimization
This project presents a comprehensive analysis of transportation network performance using the DataCo Smart Supply Chain dataset (180,519 shipment records). The analysis directly addresses Transportation Execution & Systems Team objectives for continuous improvement in North American transportation networks.
- Identify operational inefficiencies and bottlenecks in transportation networks
- Reduce late delivery rates through data-driven interventions
- Optimize shipping mode allocation for cost and performance balance
- Enable predictive intervention through machine learning
- Support strategic decision-making with actionable insights
Current state: 54.8% late delivery rate across the network represents significant operational inefficiency and customer satisfaction risk.
Target state: Reduce late deliveries by 15-25% through:
- Dynamic shipping mode optimization
- Geographic bottleneck elimination
- Product category-specific protocols
- Predictive risk scoring and intervention
- Python 3.8+ - Core programming language
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib & Seaborn - Data visualization
- MySQL - Relational database management
- SQL - Complex queries and data aggregation
- Database Normalization - 1NF, 2NF, 3NF implementation
- Scikit-learn - Predictive modeling
- Logistic Regression - Late delivery risk classification
- Model Evaluation - AUC-ROC, confusion matrix, feature importance
- Jupyter Notebook - Interactive analysis
- Node.js - Report generation
- docx - Word document creation
- Standard Class: 68.2% late delivery rate (4.1 days avg)
- First Class: 32.1% late delivery rate (1.8 days avg)
- Same Day: 12.5% late delivery rate (0.9 days avg)
Insight: Reallocating 15-20% of Standard Class to First Class for high-priority customers could reduce late deliveries by 8-12%.
- Puerto Rico: 72.3% late delivery rate (highest)
- West Virginia: 64.1% late delivery rate
- California: 31,247 late shipments (highest volume)
- Texas: 22,198 late shipments
Insight: 15 states account for 68% of all late deliveries - targeted improvements yield disproportionate gains.
- Sporting Goods: 61.8% late delivery rate
- Electronics: 58.2% late delivery rate
- Books & Media: 43.7% late delivery rate (best)
Insight: Category-specific handling protocols needed for high-risk categories.
- 87% AUC-ROC score - High prediction accuracy
- 82% True Positive Rate - Correctly identifies late deliveries
- 15% False Positive Rate - Low false alarm rate
Insight: Real-time risk scoring enables proactive intervention on high-risk shipments.
Implement automated system to reallocate high-risk shipments from Standard to First Class.
Expected Impact: 8-12% reduction in late deliveries
Deploy task force to top 15 states accounting for 68% of late deliveries.
Expected Impact: 15-20% improvement in priority markets
Develop specialized handling for Sporting Goods and Electronics.
Expected Impact: 10-15% category late delivery reduction
Build KPI dashboard with predictive risk scoring for operations visibility.
Expected Impact: Enable proactive intervention on 15-20% of at-risk orders
Use historical patterns to forecast demand surges and allocate resources.
Expected Impact: 12-18% network throughput capacity gain
# Python packages
pip install pandas numpy matplotlib seaborn scikit-learn jupyter --break-system-packages
# Node.js (for report generation)
sudo apt-get install nodejs npm
# docx package
npm install -g docx
# MySQL (for database analysis)
sudo apt-get install mysql-server-
Download dataset from Kaggle:
- Link: https://www.kaggle.com/datasets/shashwatwork/dataco-smart-supply-chain-for-big-data-analysis
- File:
DataCoSupplyChainDataset.csv
-
Place in project directory:
mkdir -p data mv DataCoSupplyChainDataset.csv data/
jupyter notebook transportation_network_analysis.ipynbRun all cells sequentially for complete analysis.
# Create database
mysql -u root -p < transportation_network_schema.sql
# Run specific queries (examples in SQL file)
mysql -u root -p transportation_network < your_query.sqlnode create_executive_report.js- Load 180,519 records from CSV
- Inspect data types, missing values, distributions
- Identify 24 key variables for analysis
- Remove duplicates
- Handle missing values in critical columns
- Create calculated fields (delivery variance, late flags)
- Standardize formats
- Overall delivery performance KPIs
- Shipping mode performance comparison
- Geographic analysis by state/region
- Product category performance
- Distribution and correlation analysis
- Root cause analysis for late deliveries
- Correlation matrix for risk factors
- Bottleneck identification (mode + route combinations)
- Pareto analysis (80/20 rule)
- Train logistic regression model
- Achieve 87% AUC-ROC score
- Feature importance analysis
- Model evaluation (confusion matrix, ROC curve)
- Synthesize findings into actionable recommendations
- Develop implementation roadmap
- Project business impact
- orders - Main fact table with order details
- shipping - Delivery performance data
- customers - Customer information and geography
- products - Product catalog
- warehouse - Distribution center locations
- Shipping mode performance analysis
- Geographic performance by state
- Product category analysis
- Warehouse performance comparison
- Bottleneck identification (mode + route)
- Cost efficiency analysis
- Time-based trends
See transportation_network_schema.sql for complete schema and queries.
- Delivery Status Distribution - Pie and bar charts
- Shipping Mode Analysis - Late rates, volumes, delivery times, variance
- Geographic Performance - Top states by volume and late delivery rate
- Product Category Performance - Volume, late rates, sales, delivery times
- Correlation Matrix - Heatmap of risk factors
- Model Performance - Confusion matrix, ROC curve, feature importance
All visualizations saved as high-resolution PNG files (300 DPI).
| Impact Area | Projected Improvement |
|---|---|
| Overall Late Delivery Rate | 15-25% reduction |
| Customer Satisfaction Score | 10-15 point increase |
| Operational Cost Efficiency | 5-8% improvement |
| Average Delivery Time | 0.3-0.5 day reduction |
| Network Throughput | 12-18% capacity gain |
This project uses the DataCo Smart Supply Chain dataset available under CC0: Public Domain license from Kaggle.
- Dataset Source: Fabian Constante, Instituto Politecnico de Leiria
- Platform: Kaggle (https://www.kaggle.com/datasets/shashwatwork/dataco-smart-supply-chain-for-big-data-analysis)
- Tools: Python, SQL, Jupyter, Node.js, docx
Last Updated: December 2025