A comprehensive implementation and evaluation of three state-of-the-art object detection architectures: Faster R-CNN, YOLOv11n, and DETR on COCO 2017 and Pascal VOC 2012 datasets.
- Overview
- Features
- Project Structure
- Models Implemented
- Datasets
- Installation
- Usage
- Evaluation Metrics
- Results
- Visualizations
This project provides a comprehensive comparison of three different object detection paradigms:
- Two-stage detectors (Faster R-CNN with ResNet-50 backbone)
- One-stage detectors (YOLOv11n)
- Transformer-based detectors (DETR with ResNet-101 backbone)
Each model is evaluated on standard benchmarks using consistent metrics, with detailed analysis of their strengths, weaknesses, and use cases.
- π Three Detection Architectures: Implementation of Faster R-CNN, YOLOv11n, and DETR
- π Comprehensive Evaluation: mAP, IoU, inference time, and FPS metrics
- π¨ Rich Visualizations:
- Feature map extraction from multiple layers
- GradCAM interpretability analysis
- Success and failure case studies
- Comparative performance charts
- π Dual Dataset Evaluation: COCO 2017 and Pascal VOC 2012
- π Detailed Documentation: Complete LaTeX report with architecture analysis
- π Production-Ready Code: Well-structured, modular implementation
Object-Detection-Models/
β
βββ Faster-RCNN-Resnet50/ # Faster R-CNN implementation
β βββ faster-rcnn.ipynb # Main notebook
β
βββ Yolov11n/ # YOLOv11n implementation
β βββ output_yolov11n_coco/ # COCO dataset results
β β βββ coco-yolo-v-11-n.ipynb
β β βββ success_case_*.png
β β βββ failure_case_*.png
β β βββ feature_maps_*.png
β βββ output_yolov11n_voc/ # Pascal VOC results
β βββ pascal-voc-yolo-v-11-n.ipynb
β βββ [similar output files]
β
βββ Detr/ # DETR implementation
β βββ output_detr_coco/ # COCO dataset results
β βββ output_detr_voc/ # Pascal VOC results
β
βββ images/ # Architecture diagrams and figures
β βββ faster-rcnn-arch.png
β βββ yolo11-arch.png
β βββ detr-arch.png
β βββ iou.png
β βββ gradcam.png
β βββ ...
β
βββ outputs_fasterrcnn/ # Faster R-CNN output files
βββ coco-dataset/ # COCO 2017 dataset
βββ pascal-voc-dataset/ # Pascal VOC 2012 dataset
β
βββ assignment_report.tex # LaTeX source
βββ assignment_report.pdf # Final report
βββ README.md # This file
Type: Two-stage detector
Architecture:
- Backbone: ResNet-50 with Feature Pyramid Network (FPN)
- Region Proposal Network (RPN): Generates candidate object regions
- RoI Head: Classifies and refines bounding boxes
Key Characteristics:
- High accuracy for complex scenes
- Excellent localization precision
- Slower inference compared to one-stage detectors
- Best for applications prioritizing accuracy over speed
Use Cases: Medical imaging, autonomous vehicles (non-real-time), quality inspection
Type: One-stage detector
Architecture:
- Backbone: CSPDarknet with Cross Stage Partial connections
- Neck: PANet (Path Aggregation Network)
- Head: Direct prediction of bounding boxes and classes
Key Characteristics:
- Real-time inference speed
- Good balance of accuracy and efficiency
- Excellent for detecting large objects
- Single forward pass architecture
Use Cases: Real-time surveillance, robotics, mobile applications, live video processing
Type: Transformer-based detector
Architecture:
- Backbone: ResNet-101
- Transformer Encoder: Global context understanding
- Transformer Decoder: Direct set prediction with learned object queries
Key Characteristics:
- No anchor boxes or NMS required
- Global reasoning through self-attention
- Best performance on objects with unusual aspect ratios
- End-to-end trainable
Use Cases: Complex scene understanding, dense object detection, research applications
- Validation Set: 5,000 images
- Categories: 80 object classes + 11 stuff categories
- Annotations: Bounding boxes, segmentation masks, keypoints
- Characteristics: Complex scenes, multiple objects, various scales
- Validation Set: 5,823 images
- Categories: 20 object classes
- Annotations: Bounding boxes, segmentation masks
- Characteristics: Single/few objects per image, clear backgrounds
Python >= 3.10
CUDA >= 11.8 (for GPU acceleration)git clone https://github.com/ranimeshehata/Object-Detection-Models.git
cd Object-Detection-Models# Using conda (recommended)
conda create -n cv python=3.10
conda activate cv
# Core dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install ultralytics # For YOLOv11
pip install pycocotools
pip install transformers # For DETR
pip install numpy pandas matplotlib pillow
pip install tqdm opencv-python
# for visualization
pip install seaborn plotlyCOCO 2017:
mkdir -p coco-dataset
cd coco-dataset
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
cd ..Pascal VOC 2012:
mkdir -p pascal-voc-dataset
cd pascal-voc-dataset
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_11-May-2012.tar
cd ..jupyter notebook Faster-RCNN-Resnet50/faster-rcnn.ipynbKey steps in the notebook:
- Load pretrained model from torchvision
- Configure COCO/VOC dataset loaders
- Run inference on validation set
- Calculate mAP and IoU metrics
- Generate feature maps and GradCAM visualizations
- Analyze success/failure cases
For COCO:
jupyter notebook Yolov11n/output_yolov11n_coco/coco-yolo-v-11-n.ipynbFor Pascal VOC:
jupyter notebook Yolov11n/output_yolov11n_voc/pascal-voc-yolo-v-11-n.ipynbKey steps:
- Load YOLOv11n pretrained weights
- Run batch inference with confidence threshold 0.25
- Convert predictions to COCO format
- Evaluate using COCO evaluation API
- Extract feature maps from backbone layers
- Visualize success and failure cases
jupyter notebook Detr/detr_evaluation.ipynbKey steps:
- Load DETR from Hugging Face transformers
- Process images through transformer encoder-decoder
- Apply Hungarian matching for evaluation
- Calculate metrics and visualize attention maps
IoU = Area of Intersection / Area of Union
Measures the overlap between predicted and ground truth bounding boxes.
mAP = (1/|C|) Γ Ξ£ AP_c
Where:
Cis the set of categoriesAP_cis the Average Precision for category c- Reported at IoU threshold 0.5 (mAP@0.5)
- Inference Time: Average time per image (milliseconds)
- FPS: Frames per second = 1000 / inference_time_ms
| Model | mAP@0.5 | Avg IoU | Inference Time (ms) | FPS |
|---|---|---|---|---|
| Faster R-CNN | 46.145 | 0.5377 | 113.9 | 9.1 |
| YOLOv11n | 44.6 | 0.5051 | 16.4 | 56 |
| DETR | 60.5 | 0.6588 | 75.3 | 13.3 |
| Model | mAP@0.5 | Avg IoU | Inference Time (ms) | FPS |
|---|---|---|---|---|
| Faster R-CNN | 48.693 | 0.3533 | 110 | 8.8 |
| YOLOv11n | 56.8 | 0.7176 | 14.48 | 70 |
| DETR | 74.2 | 0.8175 | 86.3 | 11.3 |
β
DETR achieves the highest mAP and best IoU scores, excelling at complex scenes
β
YOLOv11n offers good accuracy with reasonable speed, best for real-time needs
β
Faster R-CNN provides balanced performance with excellent localization
Feature maps are extracted from three layers of each model:
- Early layers: Low-level features (edges, textures, colors)
- Middle layers: Mid-level features (object parts, local structures)
- Late layers: High-level semantic features (complete objects)
Example outputs saved in:
Yolov11n/output_yolov11n_coco/feature_maps_8_channels.pngYolov11n/output_yolov11n_coco/feature_maps_average.png
Gradient-weighted Class Activation Mapping shows which regions contribute most to detections:
- Highlights important image regions for specific object classes
- Helps understand model decision-making
- Validates that models focus on correct features
Each model's performance is analyzed through:
- Success cases: High-confidence, accurate detections
- Failure cases: Missed objects, false positives, localization errors
Examples saved in respective output directories.