Absolutely, Richard. Here’s a fully polished, GitHub-ready README.md with badges, Table of Contents, collapsible sections, and professional formatting. You can copy and paste it directly.
- Overview
- Problem Statement
- Dataset
- Feature Engineering
- Modeling Strategy
- Results Summary
- Project Structure
- Setup Instructions
- Usage
- Key Insights
- Future Improvements
This project implements an end-to-end delivery time prediction pipeline using real-world logistics data.
It covers data preprocessing, feature engineering, model training, evaluation, model selection, and production-ready inference.
Objective: Predict delivery duration (in minutes) based on distance, traffic conditions, temporal factors, and weather.
Accurate delivery time estimation is critical for:
- Optimizing customer experience
- Supporting operational planning
- Improving logistics and fleet efficiency
Delivery time is influenced by non-linear, interacting factors, making it a regression problem best addressed with ensemble learning.
Primary dataset: Amazon Delivery Dataset (~43,000 rows)
Contains:
- Store & drop-off coordinates
- Order timestamps
- Traffic conditions
- Weather categories
- Actual delivery time (target variable)
Data Provenance: Publicly available; preprocessed chronologically to prevent data leakage.
Click to expand engineered features
| Feature | Description |
|---|---|
| distance_km | Haversine distance between store and drop-off location |
| hour | Hour of day extracted from order timestamp |
| day_of_week | Integer-encoded weekday (0–6) |
| is_weekend | Binary weekend indicator |
| traffic_level | Ordinal traffic encoding (Low → Jam) |
| weather | One-hot encoded categorical feature |
Consistency: All transformations applied during training are reused during inference.
Models Trained:
- Linear Regression
- K-Nearest Neighbors (KNN)
- Random Forest
- XGBoost (Selected Best Model)
Evaluation Metrics:
- Mean Absolute Error (MAE) – primary
- Root Mean Squared Error (RMSE)
- R² Score
Selection Criterion: Lowest test MAE for operational relevance.
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Linear Regression | High | High | Low |
| KNN | Moderate | Moderate | Low |
| Random Forest | Overfitting observed | Unstable | Low |
| XGBoost | Best | Best | 0.33 |
Key Insight: XGBoost captures non-linear interactions between traffic, time, and distance and generalizes best.
Delivery_ETA/
├── data/
│ ├── raw/
│ └── processed/
├── notebooks/
│ ├── eda_amazon.ipynb
│ └── modeling_amazon.ipynb
├── src/
│ ├── preprocessing.py
│ ├── dl_model.py
│ ├── eda_summary.py
│ ├── train_model.py
│ ├── evaluate_model.py
│ └── predict.py
├── models/
│ ├── amazon_best_model.pkl
│ └── Weather_encoder.pkl
├── results/
│ ├── figures/
│ └── tables/
├── requirements.txt
└── README.md
# Clone repository
git clone <repo-link>
cd Delivery_ETA
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Command-Line Interface
python src/predict.py
Programmatic API
from src.predict import predict_delivery_time
predict_delivery_time(
distance_km=12.5,
hour=16,
day_of_week=4,
weather="Sunny",
traffic_level="High",
is_weekend=1
)
Features:
- Computes distance if only coordinates are provided
- Normalizes categorical inputs
- Applies same encoders as used during training
Key Insights
- Traffic level is the strongest predictor
- Temporal features (hour, weekday) influence delivery time
- Distance alone is insufficient; contextual features are essential
- Ensemble models outperform linear baselines
- Model and encoders are serialized; inference logic decoupled from training
- All plots and tables are stored under results/
Future Improvements
- Residual and error distribution analysis
- Integration of real-time traffic data
- API deployment (FastAPI)
- Continuous retraining with additional datasets