Add benchmark results for anthropic/claude-opus-4.1 #334

mentatbot · 2025-08-13T18:54:07Z

This PR adds benchmark results for the anthropic/claude-opus-4.1 model, running 50 test cases with concurrency 15 as requested.

Benchmark Summary

Model: anthropic/claude-opus-4.1
Cases Run: 50 (out of 200 available)
Success Rate: 58% (29 successful, 21 failed)
Total Cost: $58.96
Concurrency: 15

Results Breakdown

Successful Cases: 29
Failed Cases: 21 (all due to output mismatch)
API Errors: 0 (clean run with no technical issues)

All benchmark result files are saved in locodiff-250425/results/*/anthropic_claude-opus-4.1/ with complete metadata, raw responses, extracted outputs, and diff files for analysis.

The benchmark ran smoothly with no API errors, indicating good model availability and performance from a technical perspective. The 58% success rate provides a baseline for comparing this model's code generation capabilities against other models in the benchmark suite.

🤖 This PR was created with Mentat. See my steps and cost here ✨

Wake on any new activity.

- Ran 50 benchmark cases with concurrency 15 - Success rate: 58% (29/50 successful) - Total cost: $58.96 - All failures were output mismatches, no API errors Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/38762b95-d970-4ed0-9002-1799026a60a0 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Ran additional 50 benchmark cases (total now 100/200) - Second batch success rate: 60% (30/50 successful) - Combined success rate: 59% (59/100 successful) - Second batch cost: $53.51 - Total cost: $112.47 - All failures were output mismatches, no API errors Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/ffb63411-4236-49bc-b205-8de2ff3108c6 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Completed full benchmark suite: 200/200 cases - Third batch success rate: 62% (62/100 successful) - Final overall success rate: 60.5% (121/200 successful) - Third batch cost: $97.98 - Total benchmark cost: $210.45 - 3 API errors in final batch (technical issues, not model failures) - Benchmark suite now complete for anthropic/claude-opus-4.1 Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/469b00b4-ef40-41e9-9ed3-8a85b6366713 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

…de-opus-4.1 - Successfully retried 3 cases that had API errors - 2 additional successes, 1 failure (output mismatch) - Final success rate: 61.5% (123/200 successful) - Additional cost: $5.02 - Final total cost: $215.48 - Benchmark suite now 100% complete with 0 API errors Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/45656c60-4a19-4499-9037-99db5311a351 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

- Added display name for anthropic/claude-opus-4.1 in benchmark_config.yaml - Generated complete documentation pages with all 27 models - Updated docs/index.html with latest benchmark results - Generated 200 case pages for Claude Opus 4.1 model - Documentation now includes complete 61.5% success rate results Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/883eaa18-ea7c-4265-9ee0-55c94e92f323 Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>

mentatbot bot requested a review from biobootloader August 13, 2025 18:54

mentatbot bot and others added 4 commits August 13, 2025 19:07

biobootloader merged commit e7bcc51 into main Aug 13, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark results for anthropic/claude-opus-4.1 #334

Add benchmark results for anthropic/claude-opus-4.1 #334

Uh oh!

mentatbot bot commented Aug 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add benchmark results for anthropic/claude-opus-4.1 #334

Add benchmark results for anthropic/claude-opus-4.1 #334

Uh oh!

Conversation

mentatbot bot commented Aug 13, 2025

Benchmark Summary

Results Breakdown

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants