Add benchmark results for anthropic/claude-4.5-sonnet #339

mentatbot · 2025-09-29T18:08:40Z

This PR adds comprehensive benchmark results for anthropic/claude-4.5-sonnet on the locodiff-250425 benchmark set.

Results Summary

Total cases: 200/200 (100% attempted)
✅ Successful: 157/200 (78.5%)
❌ Failed: 43/200 (21.5% - all output mismatches)
⚠️ API Errors: 0/200 (0%)
💰 Total cost: $47.65

Benchmark Configuration

Model: anthropic/claude-4.5-sonnet
Concurrency: 20
Benchmark directory: locodiff-250425
Test cases: All 200 cases from the benchmark set

Performance Analysis

The model achieves a 78.5% success rate on this benchmark, with all failures being output mismatches. No API errors or empty outputs remain in the final results.

Execution Notes

This benchmark run required 3 iterations to complete:

Initial run: 142 successful, 17 API errors (temporary model availability issue)
Second run: 14 successful, 1 API error (transient JSONDecodeError)
Third run: 1 successful, 0 API errors

All API errors were successfully resolved through retries, demonstrating the robustness of the retry mechanism.

🤖 This PR was created with Mentat. See my steps and cost here ✨

Wake on any new activity.

Completed comprehensive benchmark run for anthropic/claude-4.5-sonnet on the locodiff-250425 benchmark set. ## Results Summary - **Total cases**: 200/200 (100% attempted) - **✅ Successful**: 157/200 (78.5%) - **❌ Failed**: 43/200 (21.5% - all output mismatches) - **⚠️ API Errors**: 0/200 (0%) - **💰 Total cost**: $47.65 ## Benchmark Details The benchmark was run with: - Concurrency: 20 - Benchmark directory: locodiff-250425 - All 200 test cases from the benchmark set The model achieved a 78.5% success rate, with all failures being output mismatches (no API errors or empty outputs in the final results). ## Notes This benchmark run required 3 iterations: 1. Initial run: 142 successful, 17 API errors (model availability issue) 2. Second run: 14 successful, 1 API error (transient JSONDecodeError) 3. Third run: 1 successful, 0 API errors All API errors were successfully resolved through retries. Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/3e467591-0742-472f-ae6d-47b9f356add2 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Added display name for anthropic/claude-4.5-sonnet in benchmark_config.yaml - Generated visualization pages for all models including the new Sonnet 4.5 results - Updated docs/index.html with latest benchmark data - Created 200 case pages for Sonnet 4.5 model - Updated chart data and styling Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/92564c0c-4e97-4d07-81a4-4113ef5127da Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

mentatbot bot requested a review from biobootloader September 29, 2025 18:08

biobootloader merged commit 176a24b into main Sep 29, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark results for anthropic/claude-4.5-sonnet #339

Add benchmark results for anthropic/claude-4.5-sonnet #339

Uh oh!

mentatbot bot commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add benchmark results for anthropic/claude-4.5-sonnet #339

Add benchmark results for anthropic/claude-4.5-sonnet #339

Uh oh!

Conversation

mentatbot bot commented Sep 29, 2025

Results Summary

Benchmark Configuration

Performance Analysis

Execution Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants