Skip to content

Conversation

@mentatbot
Copy link
Contributor

@mentatbot mentatbot bot commented Aug 22, 2025

This PR adds comprehensive benchmark results for the deepseek/deepseek-chat-v3.1 model against the LoCoDiff-250425 benchmark suite.

Benchmark Summary

  • Model: deepseek/deepseek-chat-v3.1
  • Total benchmarks: 200 test cases
  • Success rate: 26.5% (53/200 successful)
  • Failed cases: 143 (output mismatch)
  • API errors: 4 (handled automatically)
  • Total cost: $3.39
  • Concurrency: 20 parallel requests

Results Structure

The benchmark results are organized in the standard directory structure:

locodiff-250425/results/[test_case]/deepseek_deepseek-chat-v3.1/[timestamp]/
├── metadata.json      # Run metadata, costs, and success status
├── raw_response.txt   # Complete model response
├── extracted_output.txt # Code extracted from response
└── output.diff       # Diff between expected and actual output

Performance Analysis

The model achieved a 26.5% success rate on this challenging code reconstruction benchmark, with an average cost of approximately $0.017 per test case. The benchmark covers various programming languages and repositories including React, Ghostty, Qdrant, Tldraw, and Aider.

These results can be used for model comparison and analysis using the visualization tools in the benchmark pipeline.


🤖 This PR was created with Mentat. See my steps and cost here

  • Wake on any new activity.

- Ran 200 benchmark cases with concurrency 20
- Achieved 53 successful results (26.5% success rate)
- Total cost: $3.39
- Results saved in locodiff-250425/results/ directory structure

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/8e4e5d2f-4e96-4e73-9380-c564a1816210

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
@mentatbot mentatbot bot requested a review from biobootloader August 22, 2025 18:10
mentatbot bot and others added 4 commits August 22, 2025 18:40
- Reran benchmark to handle API errors from initial run
- Successfully recovered 3/4 cases that had API errors
- Added 1 new successful case, 2 new failed cases
- 1 case still has persistent API error (JSON decode issue)
- Final status: 199/200 completed (99.5%), 54 successful (27.1% success rate)
- Total cost: $3.53

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/609ae569-6826-4e80-a193-64fa2909f774

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Successfully completed the last remaining benchmark case on second retry
- qdrant_lib_segment_tests_integration_payload_index_test.rs now shows legitimate failure (output mismatch)
- Final status: 200/200 completed (100%), 54 successful (27% success rate)
- Total cost: $3.58
- All API errors resolved

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/e70dbbff-c65b-48b7-a0ae-e5f8a2f41ae1

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Combined uv installation and setup into single step
- Export PATH in each step to ensure uv is available
- This should resolve the "uv: command not found" error in CI

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/0fcc53d6-6612-44d9-83d5-da76d5548d04

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
- Added "deepseek/deepseek-chat-v3.1": "DeepSeek Chat v3.1" to benchmark_config.yaml
- Generated complete visualization pages for all 28 models including DeepSeek Chat v3.1
- Updated docs/ directory with latest benchmark results and visualizations
- All 200 case pages generated successfully for the new model

Mentat precommit script passed. Log: https://mentat.ai/gh/AbanteAI/LoCoDiff-bench/log/fd1ba057-8078-4575-ba6a-07dbe60a9eaf

Co-authored-by: biobootloader <128252497+biobootloader@users.noreply.github.com>
@biobootloader biobootloader merged commit 1c49108 into main Aug 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants