The new leader on the world's most challenging AI benchmark
Note: Individual model scores below are higher than their published benchmarks because we use custom instructions, web search, and retry when confidence is too low. This causes all models to score higher than published results. However, relative rankings remain consistent with published results, and Sup AI maintains a significant lead.
Sup AI has achieved 52.15% accuracy on Humanity's Last Exam (HLE), surpassing all individual frontier models and setting a new SOTA on this exceptionally challenging benchmark. This result demonstrates that ensemble orchestration can unlock capabilities beyond what any single model achieves alone.
| Metric | Value |
|---|---|
| Accuracy | 52.15% |
| Questions Evaluated | 1,369 / 2,500 public* |
| Lead Over Next Best | +7.41 percentage points |
| Calibration Error (ECE) | 36.54%** |
* Questions were selected randomly. We plan to evaluate all 2,500 public questions.
** HLE measures Expected Calibration Error (ECE) by asking models to self-report their confidence, which we believe is fundamentally flawed; confidence can only truly be measured by analyzing output token probability distributions (Sup AI uses these extensively). An ECE of 36.54% means the model's stated confidence is, on average, 36.54 percentage points away from its true accuracy.
Humanity's Last Exam (HLE) is a benchmark designed to be the ultimate test of AI reasoning capabilities. Created through a collaborative effort, HLE contains 2,500 public questions spanning advanced mathematics, science, logic, and interdisciplinary reasoning. These are problems curated to challenge even the most sophisticated AI systems.
Unlike conventional benchmarks that have become saturated, HLE was specifically designed to remain difficult as AI capabilities advance. A score of 52% represents a significant milestone in AI development.
Sup AI's ensemble approach significantly outperforms all individual frontier models:
| Model | Accuracy | n | Δ vs Sup AI |
|---|---|---|---|
| Sup AI | 52.15% | 1,369 | — |
| Google Gemini 3 Pro Preview | 44.74% | 1,332 | -7.41 |
| OpenAI GPT-5 Pro | 39.53% | 1,194 | -12.62 |
| OpenAI GPT-5.1 | 38.23% | 1,287 | -13.92 |
| Anthropic Claude Opus 4.5 | 29.66% | 1,335 | -22.49 |
| xAI Grok-4 | 29.05% | 1,153 | -23.10 |
| DeepSeek v3.2 Thinking | 24.13% | 1,173 | -28.02 |
| ZhipuAI GLM-4.6 | 23.08% | 52 | -29.07 |
| Anthropic Claude Sonnet 4.5 | 18.11% | 1,259 | -34.04 |
| Alibaba Qwen3 Max | 17.31% | 52 | -34.84 |
| Moonshot Kimi K2 Thinking | 17.55% | 1,242 | -34.60 |
| Google Gemini 2.5 Pro | 16.51% | 1,254 | -35.64 |
Sup AI orchestrates multiple frontier models that are best suited for the prompt, aggregating their responses to produce higher-quality answers than any individual model. We look at the probability distributions on each individual response and chunks of each response when appropriate in order to determine confidence.
When synthesizing overall responses, we weigh individual response chunks by our confidence. If there are significant disagreements within models or their confidence is too low, we will retry. The degree to which a model specializes in what the prompt is about is factored into the weight we assign it.
Not all of the models Sup AI uses are multimodal, while some HLE questions involve images. Sup AI allows you to pass PDFs and images to models that don't natively support them by pre-processing the files.
Sup AI achieves emergent capabilities that exceed the sum of its parts.
- Question Selection: Questions were chosen at random from the 2,500 public HLE questions
- Question Processing: Each HLE question (text and optional image) is submitted to the Sup AI API
- Response Format: Models respond with structured output containing explanation, answer, and confidence score
- Automated Judging: Responses are evaluated using GPT-5.1 with structured output parsing (Pydantic strict mode)
- Metrics Calculation: Accuracy, variance (Wald estimator), and Expected Calibration Error (ECE) are computed
The responses are generated exactly as if you pasted the question into Sup AI's chat interface at https://sup.ai. There is no custom prompting for benchmarks. Even the system prompt is inserted in place of the user instructions, which any user can edit.
Web Search: Models have access to web search during evaluation. However, answers cannot be obtained through web search. On a couple of instances, models attempted to find answers online, but we verified they were unsuccessful.
System: Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}
User: [Question text + optional image]
The automated judge uses GPT-5.1 to extract the final answer and compare it against the ground truth, accounting for:
- Exact matches
- Semantic equivalence
- Numerical tolerance for quantitative answers
The judging methodology was adapted from the official HLE repository.
python3.13 -m venv .venv
source .venv/bin/activate
pip3.13 install -r requirements.txtRequired environment variables:
SUPAI_API_KEY: Your Sup AI API key (contact us)OPENAI_API_KEY: OpenAI API key (for judging)
# Generate predictions using Sup AI
python src/run_model.py \
--dataset cais/hle \
--model pro \
--num_workers 10
# Judge responses against ground truth
python src/run_judge.py \
--dataset cais/hle \
--predictions hle_pro.json \
--num_workers 10
# Generate metrics and visualization
python src/run_metrics.py --predictions judged_hle_pro.jsonWith n=1,369 questions, the 95% confidence interval for Sup AI's accuracy is approximately ±2.65 percentage points, meaning the true accuracy lies between 49.50% and 54.80% with 95% confidence.
The gap between Sup AI (52.15%) and the next-best model, Gemini 3 Pro Preview (44.74%), is statistically significant at p < 0.001.
This result demonstrates several important findings:
-
Ensemble systems can exceed individual model capabilities: The ensemble approach achieves accuracy 7+ percentage points higher than the best single model.
-
Complementary model strengths: Different models excel at different problem types; orchestration captures these diverse capabilities.
-
Benchmark validity: Despite exceeding 50% accuracy, significant headroom remains—indicating HLE continues to effectively measure frontier AI capabilities.
-
Practical accessibility: These capabilities are available today through Sup AI, enabling applications requiring SOTA reasoning.
See questions.md
| File | Description |
|---|---|
src/run_model.py |
Generates predictions using Sup AI API |
src/run_judge.py |
Evaluates predictions against ground truth |
src/run_metrics.py |
Computes metrics and generates visualizations |
hle_pro.json |
Raw model predictions |
judged_hle_pro.json |
Predictions with judge evaluations |
metrics.json |
Computed accuracy metrics by model |
metrics.png |
Accuracy comparison bar chart |
questions.md |
Per-question correctness table for each model |
traces/ |
Full response traces for each question, exactly as it would be streamed from the Sup AI API |
If you use these results or methodology, please cite:
@misc{supai-hle-2025,
title={Sup AI Achieves 52.15% on Humanity's Last Exam},
author={Sup AI},
year={2025},
url={https://github.com/supaihq/hle}
}
This evaluation code is provided for research and reproducibility purposes.
Try Sup AI: sup.ai | HLE Benchmark: lastexam.ai | Leaderboard: scale.com
