AI-powered system that analyzes UI screenshots and identifies UI components with precise bounding boxes using Microsoft OmniParser + GPT-4o.
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (React + Vite + Tailwind) │
│ - Chat interface with drag & drop image upload │
│ - Annotated image display with bounding boxes │
│ - Results: Table ↔ JSON toggle │
│ - Copy JSON to clipboard │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Backend (FastAPI + Python) │
│ POST /detect - Analyze screenshot │
│ GET /health - Health check │
│ POST /preload - Preload models │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ OmniParser v2.0 │ │ GPT-4o │
│ (microsoft/OmniParser) │ │ (Semantic Enrichment) │
│ │ │ │
│ • YOLO: Icon detection │ │ • UI type classification│
│ • EasyOCR: Text regions │ │ • Element descriptions │
│ • Precise bounding boxes│ │ • Confidence scores │
└──────────────────────────┘ └──────────────────────────┘
- Image Upload - User uploads a UI screenshot via drag & drop or file picker
- OmniParser Detection - YOLO model detects icons/buttons, EasyOCR finds text regions
- GPT-4o Enrichment - Classifies detected elements with semantic UI types and descriptions
- Visual Output - Returns annotated image with colored bounding boxes + structured JSON
# Clone OmniParser (already included in this repo)
cd OmniParser
# Create virtual environment
python3 -m venv omni_venv
source omni_venv/bin/activate
# Install dependencies
pip install torch torchvision easyocr ultralytics==8.3.70 transformers supervision==0.18.0 opencv-python-headless accelerate timm einops==0.8.0 fastapi uvicorn[standard] openai python-dotenv
# Download OmniParser v2.0 model weights from Hugging Face
mkdir -p weights/icon_detect weights/icon_caption
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.pt --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/model.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_detect/train_args.yaml --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/generation_config.json --local-dir weights
huggingface-cli download microsoft/OmniParser-v2.0 icon_caption/model.safetensors --local-dir weights
# Rename icon_caption to icon_caption_florence (required by OmniParser)
mv weights/icon_caption weights/icon_caption_florencecd .. # Back to project root
# Set OpenAI API key
export OPENAI_API_KEY=your-openai-api-key
# Or create .env file in backend/
echo "OPENAI_API_KEY=your-openai-api-key" > backend/.env
# Activate OmniParser venv and run backend
source OmniParser/omni_venv/bin/activate
cd backend
uvicorn main:app --reload --port 8000cd frontend
# Install dependencies
npm install
# Run dev server
npm run devVisit http://localhost:5173
Note: First detection request will take 30-60 seconds to load models. Subsequent requests are faster (~5-15s).
Analyze a UI screenshot and detect all UI elements.
Request:
{
"image": "data:image/png;base64,..."
}Response:
{
"elements": [
{
"type": "button",
"description": "Primary blue CTA button labeled 'Submit'",
"confidence": 0.95,
"region": "bottom-center",
"bounds": {
"x": 0.35,
"y": 0.85,
"width": 0.3,
"height": 0.08
}
},
{
"type": "text",
"description": "Email input label",
"confidence": 0.9,
"region": "top-left",
"bounds": {
"x": 0.1,
"y": 0.2,
"width": 0.15,
"height": 0.03
}
}
],
"summary": "A login form with email/password inputs and submit button",
"annotated_image": "data:image/png;base64,..."
}Check server status and model loading state.
Preload models to speed up first detection.
- Backend: FastAPI, Python 3.13
- AI Models:
- OmniParser v2.0 (Microsoft) - YOLO for icon detection, EasyOCR for text
- GPT-4o (OpenAI) - Semantic UI classification
- Frontend: React 18, Vite, Tailwind CSS, TypeScript
- ML Frameworks: PyTorch, Transformers, Ultralytics
Downloaded from microsoft/OmniParser-v2.0:
| Model | Size | Purpose |
|---|---|---|
icon_detect/model.pt |
~40 MB | YOLO model for UI element detection |
icon_caption_florence/model.safetensors |
~1 GB | Florence-2 for icon captioning |
- OmniParser for precise bounding boxes - Purpose-built for UI element detection with ~95% accuracy
- GPT-4o for semantic understanding - Excellent at classifying UI components and understanding context
- Hybrid approach - Best of both worlds: precise detection + semantic intelligence
- Normalized coordinates (0-1) - Works across any image size
- CUDA - Full GPU acceleration (fastest)
- MPS - Apple Silicon acceleration (M1/M2/M3)
- CPU - Fallback (slower but works everywhere)
- First request is slow due to model loading (~30-60s)
- Florence-2 captioning may have compatibility issues with newer transformers versions
- Large images may take longer to process
iui/
├── backend/
│ ├── main.py # FastAPI server with OmniParser integration
│ ├── requirements.txt # Python dependencies
│ └── .env # API keys (not committed)
├── frontend/
│ ├── src/
│ │ └── App.tsx # React chat interface
│ ├── package.json
│ └── vite.config.ts
├── OmniParser/
│ ├── weights/ # Model weights (downloaded)
│ │ ├── icon_detect/
│ │ └── icon_caption_florence/
│ └── omni_venv/ # Python virtual environment
└── README.md
MIT