Introducing MAST, our vision for a suite of realistic clinical benchmarks to evaluate real-world performance of medical AI systems.
First, Do NOHARM is the foundational benchmark of the MAST suite, and establishes a new framework to assess clinical safety and accuracy in AI-generated medical recommendations.
This interactive dashboard visualizes performance metrics from the NOHARM benchmark, comparing solo AI models and multi-agent teams across various medical conditions and harm scenarios.
data/— Source CSV files with benchmark resultsfrontend/— Next.js dashboard application.github/workflows/— CI/CD automationrender.yaml— Deployment configuration
cd frontend
npm install
npm run devVisit http://localhost:3000 to explore the dashboard. The dev script automatically rebuilds data from data/metrics.csv.
- Scatter Charts - Compare models across any two metrics with confidence intervals
- Top/Bottom Performers - Bar charts ranking models by selected metrics
- Model Search - Find and compare specific models or agent combinations
- Flexible Filtering - Filter by team size, conditions, harm types, and case scope
- Radar Profiles - Multi-dimensional performance visualization
AGENTS.md— Comprehensive architecture and development guidefrontend/README.md— Frontend-specific documentationPLANNING.md— Architecture planning notes