Skip to content

A comprehensive implementation and evaluation of three state-of-the-art object detection architectures: Faster R-CNN, YOLOv11n, and DETR on COCO 2017 and Pascal VOC 2012 datasets.

Notifications You must be signed in to change notification settings

ranimeshehata/Object-Detection-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Object Detection Models: Comprehensive Comparison and Analysis

A comprehensive implementation and evaluation of three state-of-the-art object detection architectures: Faster R-CNN, YOLOv11n, and DETR on COCO 2017 and Pascal VOC 2012 datasets.

Python PyTorch


πŸ“‹ Table of Contents


🎯 Overview

This project provides a comprehensive comparison of three different object detection paradigms:

  1. Two-stage detectors (Faster R-CNN with ResNet-50 backbone)
  2. One-stage detectors (YOLOv11n)
  3. Transformer-based detectors (DETR with ResNet-101 backbone)

Each model is evaluated on standard benchmarks using consistent metrics, with detailed analysis of their strengths, weaknesses, and use cases.


✨ Features

  • πŸ” Three Detection Architectures: Implementation of Faster R-CNN, YOLOv11n, and DETR
  • πŸ“Š Comprehensive Evaluation: mAP, IoU, inference time, and FPS metrics
  • 🎨 Rich Visualizations:
    • Feature map extraction from multiple layers
    • GradCAM interpretability analysis
    • Success and failure case studies
    • Comparative performance charts
  • πŸ“ˆ Dual Dataset Evaluation: COCO 2017 and Pascal VOC 2012
  • πŸ“ Detailed Documentation: Complete LaTeX report with architecture analysis
  • πŸš€ Production-Ready Code: Well-structured, modular implementation

πŸ“‚ Project Structure

Object-Detection-Models/
β”‚
β”œβ”€β”€ Faster-RCNN-Resnet50/          # Faster R-CNN implementation
β”‚   └── faster-rcnn.ipynb          # Main notebook
β”‚
β”œβ”€β”€ Yolov11n/                      # YOLOv11n implementation
β”‚   β”œβ”€β”€ output_yolov11n_coco/      # COCO dataset results
β”‚   β”‚   β”œβ”€β”€ coco-yolo-v-11-n.ipynb
β”‚   β”‚   β”œβ”€β”€ success_case_*.png
β”‚   β”‚   β”œβ”€β”€ failure_case_*.png
β”‚   β”‚   └── feature_maps_*.png
β”‚   └── output_yolov11n_voc/       # Pascal VOC results
β”‚       β”œβ”€β”€ pascal-voc-yolo-v-11-n.ipynb
β”‚       └── [similar output files]
β”‚
β”œβ”€β”€ Detr/                          # DETR implementation
β”‚   β”œβ”€β”€ output_detr_coco/          # COCO dataset results
β”‚   └── output_detr_voc/           # Pascal VOC results
β”‚
β”œβ”€β”€ images/                        # Architecture diagrams and figures
β”‚   β”œβ”€β”€ faster-rcnn-arch.png
β”‚   β”œβ”€β”€ yolo11-arch.png
β”‚   β”œβ”€β”€ detr-arch.png
β”‚   β”œβ”€β”€ iou.png
β”‚   β”œβ”€β”€ gradcam.png
β”‚   └── ...
β”‚
β”œβ”€β”€ outputs_fasterrcnn/            # Faster R-CNN output files
β”œβ”€β”€ coco-dataset/                  # COCO 2017 dataset
β”œβ”€β”€ pascal-voc-dataset/            # Pascal VOC 2012 dataset
β”‚
β”œβ”€β”€ assignment_report.tex          # LaTeX source
β”œβ”€β”€ assignment_report.pdf          # Final report
└── README.md                      # This file

πŸ€– Models Implemented

1. Faster R-CNN (ResNet-50-FPN)

Type: Two-stage detector

Architecture:

  • Backbone: ResNet-50 with Feature Pyramid Network (FPN)
  • Region Proposal Network (RPN): Generates candidate object regions
  • RoI Head: Classifies and refines bounding boxes

Key Characteristics:

  • High accuracy for complex scenes
  • Excellent localization precision
  • Slower inference compared to one-stage detectors
  • Best for applications prioritizing accuracy over speed

Use Cases: Medical imaging, autonomous vehicles (non-real-time), quality inspection


2. YOLOv11n

Type: One-stage detector

Architecture:

  • Backbone: CSPDarknet with Cross Stage Partial connections
  • Neck: PANet (Path Aggregation Network)
  • Head: Direct prediction of bounding boxes and classes

Key Characteristics:

  • Real-time inference speed
  • Good balance of accuracy and efficiency
  • Excellent for detecting large objects
  • Single forward pass architecture

Use Cases: Real-time surveillance, robotics, mobile applications, live video processing


3. DETR (DEtection TRansformer)

Type: Transformer-based detector

Architecture:

  • Backbone: ResNet-101
  • Transformer Encoder: Global context understanding
  • Transformer Decoder: Direct set prediction with learned object queries

Key Characteristics:

  • No anchor boxes or NMS required
  • Global reasoning through self-attention
  • Best performance on objects with unusual aspect ratios
  • End-to-end trainable

Use Cases: Complex scene understanding, dense object detection, research applications


πŸ“Š Datasets

COCO 2017 (Common Objects in Context)

  • Validation Set: 5,000 images
  • Categories: 80 object classes + 11 stuff categories
  • Annotations: Bounding boxes, segmentation masks, keypoints
  • Characteristics: Complex scenes, multiple objects, various scales

Pascal VOC 2012

  • Validation Set: 5,823 images
  • Categories: 20 object classes
  • Annotations: Bounding boxes, segmentation masks
  • Characteristics: Single/few objects per image, clear backgrounds

πŸ› οΈ Installation

Prerequisites

Python >= 3.10
CUDA >= 11.8 (for GPU acceleration)

Step 1: Clone the Repository

git clone https://github.com/ranimeshehata/Object-Detection-Models.git
cd Object-Detection-Models

Step 2: Create Virtual Environment

# Using conda (recommended)
conda create -n cv python=3.10
conda activate cv

Step 3: Install Dependencies

# Core dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install ultralytics  # For YOLOv11
pip install pycocotools
pip install transformers  # For DETR
pip install numpy pandas matplotlib pillow
pip install tqdm opencv-python

# for visualization
pip install seaborn plotly

Step 4: Download Datasets

COCO 2017:

mkdir -p coco-dataset
cd coco-dataset
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
cd ..

Pascal VOC 2012:

mkdir -p pascal-voc-dataset
cd pascal-voc-dataset
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_11-May-2012.tar
cd ..

πŸš€ Usage

Running Faster R-CNN

jupyter notebook Faster-RCNN-Resnet50/faster-rcnn.ipynb

Key steps in the notebook:

  1. Load pretrained model from torchvision
  2. Configure COCO/VOC dataset loaders
  3. Run inference on validation set
  4. Calculate mAP and IoU metrics
  5. Generate feature maps and GradCAM visualizations
  6. Analyze success/failure cases

Running YOLOv11n

For COCO:

jupyter notebook Yolov11n/output_yolov11n_coco/coco-yolo-v-11-n.ipynb

For Pascal VOC:

jupyter notebook Yolov11n/output_yolov11n_voc/pascal-voc-yolo-v-11-n.ipynb

Key steps:

  1. Load YOLOv11n pretrained weights
  2. Run batch inference with confidence threshold 0.25
  3. Convert predictions to COCO format
  4. Evaluate using COCO evaluation API
  5. Extract feature maps from backbone layers
  6. Visualize success and failure cases

Running DETR

jupyter notebook Detr/detr_evaluation.ipynb

Key steps:

  1. Load DETR from Hugging Face transformers
  2. Process images through transformer encoder-decoder
  3. Apply Hungarian matching for evaluation
  4. Calculate metrics and visualize attention maps

πŸ“ Evaluation Metrics

1. Intersection over Union (IoU)

IoU = Area of Intersection / Area of Union

Measures the overlap between predicted and ground truth bounding boxes.

2. Mean Average Precision (mAP)

mAP = (1/|C|) Γ— Ξ£ AP_c

Where:

  • C is the set of categories
  • AP_c is the Average Precision for category c
  • Reported at IoU threshold 0.5 (mAP@0.5)

3. Inference Time & FPS

  • Inference Time: Average time per image (milliseconds)
  • FPS: Frames per second = 1000 / inference_time_ms

πŸ“ˆ Results

Performance on COCO Validation Set

Model mAP@0.5 Avg IoU Inference Time (ms) FPS
Faster R-CNN 46.145 0.5377 113.9 9.1
YOLOv11n 44.6 0.5051 16.4 56
DETR 60.5 0.6588 75.3 13.3

Performance on Pascal VOC Dataset

Model mAP@0.5 Avg IoU Inference Time (ms) FPS
Faster R-CNN 48.693 0.3533 110 8.8
YOLOv11n 56.8 0.7176 14.48 70
DETR 74.2 0.8175 86.3 11.3

Key Findings

βœ… DETR achieves the highest mAP and best IoU scores, excelling at complex scenes
βœ… YOLOv11n offers good accuracy with reasonable speed, best for real-time needs
βœ… Faster R-CNN provides balanced performance with excellent localization


🎨 Visualizations

Feature Map Extraction

Feature maps are extracted from three layers of each model:

  • Early layers: Low-level features (edges, textures, colors)
  • Middle layers: Mid-level features (object parts, local structures)
  • Late layers: High-level semantic features (complete objects)

Example outputs saved in:

  • Yolov11n/output_yolov11n_coco/feature_maps_8_channels.png
  • Yolov11n/output_yolov11n_coco/feature_maps_average.png

GradCAM Visualization

Gradient-weighted Class Activation Mapping shows which regions contribute most to detections:

  • Highlights important image regions for specific object classes
  • Helps understand model decision-making
  • Validates that models focus on correct features

Success and Failure Cases

Each model's performance is analyzed through:

  • Success cases: High-confidence, accurate detections
  • Failure cases: Missed objects, false positives, localization errors

Examples saved in respective output directories.


About

A comprehensive implementation and evaluation of three state-of-the-art object detection architectures: Faster R-CNN, YOLOv11n, and DETR on COCO 2017 and Pascal VOC 2012 datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •