Skip to content

In this project, I conducted basic analysis, feature engineering, normalization, and outlier handling, along with statistical and non-parametric testing to extract insights.

Notifications You must be signed in to change notification settings

sayed-ashfaq/Delhivery-DataAnalysis

Repository files navigation

Delhivery Data Analysis

About Delhivery

Delhivery is the largest and fastest-growing fully integrated logistics provider in India as of Fiscal 2021. The company aims to build the operating system for commerce through a blend of world-class infrastructure, high-quality logistics operations, and cutting-edge engineering and technology capabilities.

The data team at Delhivery leverages vast datasets to enhance business intelligence, drive operational efficiency, and maintain profitability, creating a significant competitive edge.


Objective

The goal of this project is to process and analyze data generated by Delhivery's logistics operations to:

  1. Clean, sanitize, and manipulate raw data to derive actionable insights.
  2. Create useful features for the data science team to develop forecasting models.

Dataset

The dataset consists of records from Delhivery's logistics and operational data pipeline.

Key Features:

  • data: Indicates if the record is training or testing data.
  • trip_creation_time: Timestamp of trip creation.
  • route_schedule_uuid: Unique identifier for a route schedule.
  • route_type: Type of transportation (FTL, Carting).
    • FTL: Full Truck Load shipments, faster delivery as there are no intermediate pickups/drop-offs.
    • Carting: Delivery system using smaller vehicles (carts).
  • trip_uuid: Unique identifier for a trip (a trip can involve multiple source and destination centers).
  • source_center: ID of the trip's origin center.
  • source_name: Name of the trip's origin center.
  • destination_center: ID of the destination center.
  • destination_name: Name of the destination center.
  • od_start_time: Trip start time.
  • od_end_time: Trip end time.
  • start_scan_to_end_scan: Total time taken for delivery from source to destination.
  • actual_distance_to_destination: Actual distance in kilometers between source and destination.
  • actual_time: Cumulative time taken to complete the delivery.
  • osrm_time: Time calculated by the Open-Source Routing Machine (OSRM) considering shortest paths and typical traffic conditions (cumulative).
  • osrm_distance: Distance calculated by OSRM (cumulative).
  • segment_actual_time: Time taken for a segment of the delivery.
  • segment_osrm_time: OSRM-calculated time for a delivery segment.
  • segment_osrm_distance: OSRM-calculated distance for a delivery segment.

Additional Fields:

Some fields with currently unclear meanings, like is_cutoff, cutoff_factor, cutoff_timestamp, and factor, are included for completeness and may be explored further.


Process Overview

1. Feature Engineering:

  • Derived meaningful metrics such as:
    • time_diff_hours: Time difference between od_start_time and od_end_time.
    • Extracted components from timestamps (e.g., month, year, day of the week).
    • Split and standardized source and destination names into city, place code, and state.

2. Data Cleaning:

  • Handled missing values using appropriate imputation techniques.
  • Addressed outliers with boxplots and the IQR method.

3. Categorical Feature Handling:

  • Applied one-hot encoding to variables like route_type for better interpretability in downstream models.

4. Normalization and Standardization:

  • Used MinMaxScaler and StandardScaler for numerical columns to align features to a uniform scale.

Key Insights

  1. Route Type Insights:

    • FTL routes are faster and more efficient for long distances compared to Carting.
  2. Source and Destination Patterns:

    • High-frequency routes indicate key operational hubs that could benefit from resource optimization.
  3. Time Efficiency:

    • Delivery times vary significantly by route type, season, and traffic conditions.
  4. OSRM vs. Actual Metrics:

    • Discrepancies between OSRM-calculated and actual times/distances highlight areas for improving routing algorithms.

Tools and Libraries

This project utilized the following tools:

  • Python:
    • Pandas for data manipulation.
    • Matplotlib and Seaborn for visualization.
    • Sklearn for preprocessing and scaling.
  • Jupyter Notebook: For interactive analysis and documentation.

Repository Structure

  • data/: Contains the dataset used for analysis.
  • notebooks/: Jupyter Notebooks documenting the analysis process.
  • visualizations/: Saved plots and charts.
  • README.md: Overview of the project (this file).

Next Steps

Future directions for this project include:

  1. Developing predictive models for delivery time and distance.
  2. Investigating patterns in the unknown fields (is_cutoff, cutoff_factor, etc.).
  3. Implementing clustering techniques to identify high-demand routes.

Acknowledgments

  • Dataset Source: Provided by Scaler for this analysis.
  • Python Libraries: Thanks to the open-source Python community for providing versatile data analysis tools.

License

This project is licensed for educational and non-commercial use only. If utilizing any part of this repository, please credit the author.

About

In this project, I conducted basic analysis, feature engineering, normalization, and outlier handling, along with statistical and non-parametric testing to extract insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published