UI Component Detector

AI-powered system that analyzes UI screenshots and identifies UI components with precise bounding boxes using Microsoft OmniParser + GPT-4o.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Frontend (React + Vite + Tailwind)                             │
│  - Chat interface with drag & drop image upload                 │
│  - Annotated image display with bounding boxes                  │
│  - Results: Table ↔ JSON toggle                                 │
│  - Copy JSON to clipboard                                       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Backend (FastAPI + Python)                                     │
│  POST /detect - Analyze screenshot                              │
│  GET /health - Health check                                     │
│  POST /preload - Preload models                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌──────────────────────────┐    ┌──────────────────────────┐
│  OmniParser v2.0         │    │  GPT-4o                  │
│  (microsoft/OmniParser)  │    │  (Semantic Enrichment)   │
│                          │    │                          │
│  • YOLO: Icon detection  │    │  • UI type classification│
│  • EasyOCR: Text regions │    │  • Element descriptions  │
│  • Precise bounding boxes│    │  • Confidence scores     │
└──────────────────────────┘    └──────────────────────────┘

How It Works

Image Upload - User uploads a UI screenshot via drag & drop or file picker
OmniParser Detection - YOLO model detects icons/buttons, EasyOCR finds text regions
GPT-4o Enrichment - Classifies detected elements with semantic UI types and descriptions
Visual Output - Returns annotated image with colored bounding boxes + structured JSON

Quick Start

1. Clone OmniParser & Download Models

# Clone OmniParser (already included in this repo)
cd OmniParser

# Create virtual environment
python3 -m venv omni_venv
source omni_venv/bin/activate

# Install dependencies
pip install torch torchvision easyocr ultralytics==8.3.70 transformers supervision==0.18.0 opencv-python-headless accelerate timm einops==0.8.0 fastapi uvicorn[standard] openai python-dotenv

# Download OmniParser v2.0 model weights from Hugging Face
mkdir -p weights/icon_detect weights/icon_caption
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.pt --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/train_args.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/generation_config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/model.safetensors --local-dir weights

# Rename icon_caption to icon_caption_florence (required by OmniParser)
mv weights/icon_caption weights/icon_caption_florence

2. Backend Setup

cd ..  # Back to project root

# Set OpenAI API key
export OPENAI_API_KEY=your-openai-api-key

# Or create .env file in backend/
echo "OPENAI_API_KEY=your-openai-api-key" > backend/.env

# Activate OmniParser venv and run backend
source OmniParser/omni_venv/bin/activate
cd backend
uvicorn main:app --reload --port 8000

3. Frontend Setup

cd frontend

# Install dependencies
npm install

# Run dev server
npm run dev

4. Open App

Visit http://localhost:5173

Note: First detection request will take 30-60 seconds to load models. Subsequent requests are faster (~5-15s).

API Endpoints

POST /detect

Analyze a UI screenshot and detect all UI elements.

Request:

{
  "image": "data:image/png;base64,..."
}

Response:

{
  "elements": [
    {
      "type": "button",
      "description": "Primary blue CTA button labeled 'Submit'",
      "confidence": 0.95,
      "region": "bottom-center",
      "bounds": {
        "x": 0.35,
        "y": 0.85,
        "width": 0.3,
        "height": 0.08
      }
    },
    {
      "type": "text",
      "description": "Email input label",
      "confidence": 0.9,
      "region": "top-left",
      "bounds": {
        "x": 0.1,
        "y": 0.2,
        "width": 0.15,
        "height": 0.03
      }
    }
  ],
  "summary": "A login form with email/password inputs and submit button",
  "annotated_image": "data:image/png;base64,..."
}

GET /health

Check server status and model loading state.

POST /preload

Preload models to speed up first detection.

Tech Stack

Backend: FastAPI, Python 3.13
AI Models:
- OmniParser v2.0 (Microsoft) - YOLO for icon detection, EasyOCR for text
- GPT-4o (OpenAI) - Semantic UI classification
Frontend: React 18, Vite, Tailwind CSS, TypeScript
ML Frameworks: PyTorch, Transformers, Ultralytics

Model Weights

Downloaded from microsoft/OmniParser-v2.0:

Model	Size	Purpose
`icon_detect/model.pt`	~40 MB	YOLO model for UI element detection
`icon_caption_florence/model.safetensors`	~1 GB	Florence-2 for icon captioning

Why This Architecture?

OmniParser for precise bounding boxes - Purpose-built for UI element detection with ~95% accuracy
GPT-4o for semantic understanding - Excellent at classifying UI components and understanding context
Hybrid approach - Best of both worlds: precise detection + semantic intelligence
Normalized coordinates (0-1) - Works across any image size

Device Support

CUDA - Full GPU acceleration (fastest)
MPS - Apple Silicon acceleration (M1/M2/M3)
CPU - Fallback (slower but works everywhere)

Limitations

First request is slow due to model loading (~30-60s)
Florence-2 captioning may have compatibility issues with newer transformers versions
Large images may take longer to process

Project Structure

iui/
├── backend/
│   ├── main.py           # FastAPI server with OmniParser integration
│   ├── requirements.txt  # Python dependencies
│   └── .env              # API keys (not committed)
├── frontend/
│   ├── src/
│   │   └── App.tsx       # React chat interface
│   ├── package.json
│   └── vite.config.ts
├── OmniParser/
│   ├── weights/          # Model weights (downloaded)
│   │   ├── icon_detect/
│   │   └── icon_caption_florence/
│   └── omni_venv/        # Python virtual environment
└── README.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
OmniParser		OmniParser
backend		backend
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UI Component Detector

Architecture

How It Works

Quick Start

1. Clone OmniParser & Download Models

2. Backend Setup

3. Frontend Setup

4. Open App

API Endpoints

POST /detect

GET /health

POST /preload

Tech Stack

Model Weights

Why This Architecture?

Device Support

Limitations

Project Structure

License

About

Uh oh!

Releases

Packages

Languages

shivvamm/iui

Folders and files

Latest commit

History

Repository files navigation

UI Component Detector

Architecture

How It Works

Quick Start

1. Clone OmniParser & Download Models

2. Backend Setup

3. Frontend Setup

4. Open App

API Endpoints

POST /detect

GET /health

POST /preload

Tech Stack

Model Weights

Why This Architecture?

Device Support

Limitations

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages