Skip to content

Conversation

@mentatbot
Copy link
Contributor

@mentatbot mentatbot bot commented Sep 29, 2025

This PR adds comprehensive benchmark results for anthropic/claude-4.5-sonnet on the locodiff-250425 benchmark set.

Results Summary

  • Total cases: 200/200 (100% attempted)
  • ✅ Successful: 157/200 (78.5%)
  • ❌ Failed: 43/200 (21.5% - all output mismatches)
  • ⚠️ API Errors: 0/200 (0%)
  • 💰 Total cost: $47.65

Benchmark Configuration

  • Model: anthropic/claude-4.5-sonnet
  • Concurrency: 20
  • Benchmark directory: locodiff-250425
  • Test cases: All 200 cases from the benchmark set

Performance Analysis

The model achieves a 78.5% success rate on this benchmark, with all failures being output mismatches. No API errors or empty outputs remain in the final results.

Execution Notes

This benchmark run required 3 iterations to complete:

  1. Initial run: 142 successful, 17 API errors (temporary model availability issue)
  2. Second run: 14 successful, 1 API error (transient JSONDecodeError)
  3. Third run: 1 successful, 0 API errors

All API errors were successfully resolved through retries, demonstrating the robustness of the retry mechanism.


🤖 This PR was created with Mentat. See my steps and cost here

  • Wake on any new activity.

Completed comprehensive benchmark run for anthropic/claude-4.5-sonnet on the locodiff-250425 benchmark set.

## Results Summary

- **Total cases**: 200/200 (100% attempted)
- **✅ Successful**: 157/200 (78.5%)
- **❌ Failed**: 43/200 (21.5% - all output mismatches)
- **⚠️ API Errors**: 0/200 (0%)
- **💰 Total cost**: $47.65

## Benchmark Details

The benchmark was run with:
- Concurrency: 20
- Benchmark directory: locodiff-250425
- All 200 test cases from the benchmark set

The model achieved a 78.5% success rate, with all failures being output mismatches (no API errors or empty outputs in the final results).

## Notes

This benchmark run required 3 iterations:
1. Initial run: 142 successful, 17 API errors (model availability issue)
2. Second run: 14 successful, 1 API error (transient JSONDecodeError)
3. Third run: 1 successful, 0 API errors

All API errors were successfully resolved through retries.

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/3e467591-0742-472f-ae6d-47b9f356add2

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
@mentatbot mentatbot bot requested a review from biobootloader September 29, 2025 18:08
- Added display name for anthropic/claude-4.5-sonnet in benchmark_config.yaml
- Generated visualization pages for all models including the new Sonnet 4.5 results
- Updated docs/index.html with latest benchmark data
- Created 200 case pages for Sonnet 4.5 model
- Updated chart data and styling

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/92564c0c-4e97-4d07-81a4-4113ef5127da

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
@biobootloader biobootloader merged commit 176a24b into main Sep 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants