Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions .claude/commands/draft-release-description.md

This file was deleted.

19 changes: 19 additions & 0 deletions .claude/commands/prepare-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
I am about to release a new version of this package. Please take the following steps to make sure it is successful:

Understand changes
1. Look at my currently staged changes to identify what changes were made.

Checks
1. Make sure the pyproject.toml was updated with a new version number.
2. Ensure there are no spelling and grammar mistakes.
3. Run all formatting, linting, and type checking: `make check`
4. Run `uv build` to make sure the package builds correctly.
5. If any of these checks fail, please stop and inform me about the issues so we can fix them before proceeding.

Draft release notes
1. Look at the previous release logs at https://github.com/microsoft/eval-recipes/releases your draft release MUST follow the same style and structure.
2. Create a draft release description based on the recent code changes and place it in `media/draft_release_{version}.md`
3. At the end of the release notes be sure to include:
```
**Full Changelog**: https://github.com/microsoft/eval-recipes/compare/v0.x1.y1...v0.x2.y2
```
1 change: 1 addition & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@
},
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"markdown.extension.orderedList.marker": "one"
}
78 changes: 50 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,69 @@
# Eval Recipes
<h1 align="center">
Eval Recipes
</h1>
<p align="center">
<p align="center">Evaluate AI agents with benchmarking harnesses and online evaluation recipes.
</p>
</p>
<p align="center">
<a href="https://github.com/astral-sh/uv"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" alt="uv"></a>
<a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+"></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</p>

Eval Recipes is a library dedicated to make it easier to keep up with the state-of-the-art in evaluating AI agents.
It currently has two main components: a **benchmarking** harness for evaluating CLI agents (GitHub Copilot CLI, Claude Code, etc) on real-world tasks via containers and an **online evaluation** framework for LLM chat assistants.
The common thread between these components is the concept of [recipes](https://sundaylettersfromsam.substack.com/p/what-is-an-ai-recipe)
The common thread between these components is the concept of [recipes](https://sundaylettersfromsam.substack.com/p/what-is-an-ai-recipe)
which are a mix of code and LLM calls to achieve a desired tradeoff between flexibility and quality.


# Benchmarking
## Installation

Eval Recipes provides a benchmarking harness for evaluating AI agents on real-world tasks in isolated Docker containers.
We have a few sample tasks ranging from creating CLI applications to automations. Agents are automatically scored based on deterministic and semantic tests using a specialized auditing agent.
```bash
# Benchmarking requires certain prerequisites, see the full documentation for more details.
# With uv (add to project dependencies, pinned to a release tag)
uv add "eval-recipes @ git+https://github.com/microsoft/eval-recipes@v0.29"

Additional features include agent continuation (automatically providing follow-up prompts when needed), multi-trial evaluation for consistency measurement, and reporting with HTML dashboards.
# With pip
pip install "git+https://github.com/microsoft/eval-recipes@v0.29"
```

> [!WARNING]
> This library is very early and everything is subject to change. Consider pinning the dependency to a commit with the command like: `uv pip install "git+https://github.com/microsoft/eval-recipes@v0.0.20"`

## Run Benchmarks Quickly

Check [BENCHMARKING.md](./docs/BENCHMARKING.md), currently running benchmarks requires some additional setup.
# Benchmarking

## Running Benchmarks
Eval Recipes provides a benchmarking harness for evaluating AI agents on real-world tasks in isolated Docker containers. It supports score based, comparison based, and third party benchmarks.
We include tasks ranging from creating CLI applications to automations. Agents are automatically scored based on deterministic and semantic tests using a specialized auditing agent.
Additional features include agent continuation (automatically providing follow-up prompts when needed), multi-trial evaluation for consistency measurement, and reporting with HTML dashboards.

```bash
# The default agents/tasks require these environment variables. Check the agent definitions for others.
export ANTHROPIC_API_KEY=your_anthropic_key
export OPENAI_API_KEY=your_openai_key
## Usage

uv run scripts/run_benchmarks.py --num-trials 2
1. Create agent definition(s). Examples are provided in [data/agents](./data/agents).
1. Create task definition(s). Examples are provided in [data/tasks](./data/tasks).
1. Create a run configuration. Examples are provided in [data/eval-setups](./data/eval-setups).

# Get more info about available arguments
uv run scripts/run_benchmarks.py --help
```python
import yaml
from eval_recipes.benchmarking.harness import Harness
from eval_recipes.benchmarking.schemas import ScoreRunSpec

with Path("score-default.yaml").open(encoding="utf-8") as f:
run_definition = ScoreRunSpec(**yaml.safe_load(f))

harness = Harness(
agents_dir=Path("data/agents"),
tasks_dir=Path("data/tasks"),
run_definition=run_definition,
)
asyncio.run(harness.run())
```

Results are saved to timestamped directories in `data/benchmarking/runs/` containing agent logs, test outputs, timing data, and structured results.
Any of these files may contain secrets that were used during the evaluation run. **NEVER** commit these files to source control without first checking for secrets.
For detailed documentation on creating custom agents and tasks, see [BENCHMARKING.md](./docs/BENCHMARKING.md).
See [docs/BENCHMARKING.md](./docs/BENCHMARKING.md) for full details, including installation prerequisites.


---


# Online Evaluations
Expand All @@ -55,15 +86,6 @@ uv run marimo edit demos/1_evaluate.py
# Select Y to run in a sandboxed venv
```

### 3. Start using the package

```bash
uv pip install "git+https://github.com/microsoft/eval-recipes"
```

> [!WARNING]
> This library is very early and everything is subject to change. Consider pinning the dependency to a commit with the command like: `uv pip install "git+https://github.com/microsoft/eval-recipes@v0.0.20"`


## High Level API

Expand Down
132 changes: 132 additions & 0 deletions data/eval-setups/score-default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
type: score
definitions:
- agent: amplifier_v1
trials: 2
tasks:
- task: arxiv_conclusion_extraction
- task: arxiv_paper_summarizer
- task: code-discrepancy-docs-knack
- task: code-discrepancy-docstrings-grasp
- task: code-discrepancy-tutorials-grasp
- task: cpsc_recall_monitor
- task: cross_repo_improvement_tool
- task: email_drafting
- task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
- task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
- task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
- task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
- task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
- task: gdpval_extraction
- task: github_docs_extractor
- task: image_tagging
- task: linkedin_drafting
- task: markdown_deck_converter
- task: news_research_tool
- task: pdf-hr-q4
- task: product_review_finder
- task: repo_embedding_server
- task: style_blender
- agent: amplifier_foundation
trials: 2
tasks:
- task: arxiv_conclusion_extraction
- task: arxiv_paper_summarizer
- task: code-discrepancy-docs-knack
- task: code-discrepancy-docstrings-grasp
- task: code-discrepancy-tutorials-grasp
- task: cpsc_recall_monitor
- task: cross_repo_improvement_tool
- task: email_drafting
- task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
- task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
- task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
- task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
- task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
- task: gdpval_extraction
- task: github_docs_extractor
- task: image_tagging
- task: linkedin_drafting
- task: markdown_deck_converter
- task: news_research_tool
- task: pdf-hr-q4
- task: product_review_finder
- task: repo_embedding_server
- task: style_blender
- agent: claude_code
trials: 2
tasks:
- task: arxiv_conclusion_extraction
- task: arxiv_paper_summarizer
- task: code-discrepancy-docs-knack
- task: code-discrepancy-docstrings-grasp
- task: code-discrepancy-tutorials-grasp
- task: cpsc_recall_monitor
- task: cross_repo_improvement_tool
- task: email_drafting
- task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
- task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
- task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
- task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
- task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
- task: gdpval_extraction
- task: github_docs_extractor
- task: image_tagging
- task: linkedin_drafting
- task: markdown_deck_converter
- task: news_research_tool
- task: pdf-hr-q4
- task: product_review_finder
- task: repo_embedding_server
- task: style_blender
- agent: gh_cli
trials: 2
tasks:
- task: arxiv_conclusion_extraction
- task: arxiv_paper_summarizer
- task: code-discrepancy-docs-knack
- task: code-discrepancy-docstrings-grasp
- task: code-discrepancy-tutorials-grasp
- task: cpsc_recall_monitor
- task: cross_repo_improvement_tool
- task: email_drafting
- task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
- task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
- task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
- task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
- task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
- task: gdpval_extraction
- task: github_docs_extractor
- task: image_tagging
- task: linkedin_drafting
- task: markdown_deck_converter
- task: news_research_tool
- task: pdf-hr-q4
- task: product_review_finder
- task: repo_embedding_server
- task: style_blender
- agent: openai_codex
trials: 2
tasks:
- task: arxiv_conclusion_extraction
- task: arxiv_paper_summarizer
- task: code-discrepancy-docs-knack
- task: code-discrepancy-docstrings-grasp
- task: code-discrepancy-tutorials-grasp
- task: cpsc_recall_monitor
- task: cross_repo_improvement_tool
- task: email_drafting
- task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
- task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
- task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
- task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
- task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
- task: gdpval_extraction
- task: github_docs_extractor
- task: image_tagging
- task: linkedin_drafting
- task: markdown_deck_converter
- task: news_research_tool
- task: pdf-hr-q4
- task: product_review_finder
- task: repo_embedding_server
- task: style_blender
Loading
Loading