microsoft · DavidKoleczek · Jan 17, 2026 · Jan 17, 2026
diff --git a/.claude/commands/draft-release-description.md b/.claude/commands/draft-release-description.md
diff --git a/.claude/commands/prepare-release.md b/.claude/commands/prepare-release.md
@@ -0,0 +1,19 @@
+I am about to release a new version of this package. Please take the following steps to make sure it is successful:
+
+Understand changes
+1. Look at my currently staged changes to identify what changes were made.
+
+Checks
+1. Make sure the pyproject.toml was updated with a new version number.
+2. Ensure there are no spelling and grammar mistakes.
+3. Run all formatting, linting, and type checking: `make check`
+4. Run `uv build` to make sure the package builds correctly.
+5. If any of these checks fail, please stop and inform me about the issues so we can fix them before proceeding.
+
+Draft release notes
+1. Look at the previous release logs at https://github.com/microsoft/eval-recipes/releases your draft release MUST follow the same style and structure.
+2. Create a draft release description based on the recent code changes and place it in `media/draft_release_{version}.md`
+3. At the end of the release notes be sure to include:
+```
+**Full Changelog**:  https://github.com/microsoft/eval-recipes/compare/v0.x1.y1...v0.x2.y2
+```
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -9,4 +9,5 @@
   },
   "python.testing.unittestEnabled": false,
   "python.testing.pytestEnabled": true,
+  "markdown.extension.orderedList.marker": "one"
 }
diff --git a/README.md b/README.md
@@ -1,38 +1,69 @@
-# Eval Recipes
+<h1 align="center">
+    Eval Recipes
+</h1>
+<p align="center">
+    <p align="center">Evaluate AI agents with benchmarking harnesses and online evaluation recipes.
+    </p>
+</p>
+<p align="center">
+    <a href="https://github.com/astral-sh/uv"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" alt="uv"></a>
+    <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+"></a>
+    <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
+</p>
 
 Eval Recipes is a library dedicated to make it easier to keep up with the state-of-the-art in evaluating AI agents.
 It currently has two main components: a **benchmarking** harness for evaluating CLI agents (GitHub Copilot CLI, Claude Code, etc) on real-world tasks via containers and an **online evaluation** framework for LLM chat assistants.
-The common thread between these components is the concept of [recipes](https://sundaylettersfromsam.substack.com/p/what-is-an-ai-recipe) 
+The common thread between these components is the concept of [recipes](https://sundaylettersfromsam.substack.com/p/what-is-an-ai-recipe)
 which are a mix of code and LLM calls to achieve a desired tradeoff between flexibility and quality.
 
 
-# Benchmarking
+## Installation
 
-Eval Recipes provides a benchmarking harness for evaluating AI agents on real-world tasks in isolated Docker containers.
-We have a few sample tasks ranging from creating CLI applications to automations. Agents are automatically scored based on deterministic and semantic tests using a specialized auditing agent.
+```bash
+# Benchmarking requires certain prerequisites, see the full documentation for more details.
+# With uv (add to project dependencies, pinned to a release tag)
+uv add "eval-recipes @ git+https://github.com/microsoft/eval-recipes@v0.29"
 
-Additional features include agent continuation (automatically providing follow-up prompts when needed), multi-trial evaluation for consistency measurement, and reporting with HTML dashboards.
+# With pip
+pip install "git+https://github.com/microsoft/eval-recipes@v0.29"
+```
+
+> [!WARNING]
+> This library is very early and everything is subject to change. Consider pinning the dependency to a commit with the command like: `uv pip install "git+https://github.com/microsoft/eval-recipes@v0.0.20"`
 
-## Run Benchmarks Quickly
 
-Check [BENCHMARKING.md](./docs/BENCHMARKING.md), currently running benchmarks requires some additional setup.
+# Benchmarking
 
-## Running Benchmarks
+Eval Recipes provides a benchmarking harness for evaluating AI agents on real-world tasks in isolated Docker containers. It supports score based, comparison based, and third party benchmarks.
+We include tasks ranging from creating CLI applications to automations. Agents are automatically scored based on deterministic and semantic tests using a specialized auditing agent.
+Additional features include agent continuation (automatically providing follow-up prompts when needed), multi-trial evaluation for consistency measurement, and reporting with HTML dashboards.
 
-```bash
-# The default agents/tasks require these environment variables. Check the agent definitions for others.
-export ANTHROPIC_API_KEY=your_anthropic_key
-export OPENAI_API_KEY=your_openai_key
+## Usage
 
-uv run scripts/run_benchmarks.py --num-trials 2
+1. Create agent definition(s). Examples are provided in [data/agents](./data/agents).
+1. Create task definition(s). Examples are provided in [data/tasks](./data/tasks).
+1. Create a run configuration. Examples are provided in [data/eval-setups](./data/eval-setups).
 
-# Get more info about available arguments
-uv run scripts/run_benchmarks.py --help
+```python
+import yaml
+from eval_recipes.benchmarking.harness import Harness
+from eval_recipes.benchmarking.schemas import ScoreRunSpec
+
+with Path("score-default.yaml").open(encoding="utf-8") as f:
+    run_definition = ScoreRunSpec(**yaml.safe_load(f))
+
+harness = Harness(
+    agents_dir=Path("data/agents"),
+    tasks_dir=Path("data/tasks"),
+    run_definition=run_definition,
+)
+asyncio.run(harness.run())
 ```
 
-Results are saved to timestamped directories in `data/benchmarking/runs/` containing agent logs, test outputs, timing data, and structured results.
-Any of these files may contain secrets that were used during the evaluation run. **NEVER** commit these files to source control without first checking for secrets.
-For detailed documentation on creating custom agents and tasks, see [BENCHMARKING.md](./docs/BENCHMARKING.md).
+See [docs/BENCHMARKING.md](./docs/BENCHMARKING.md) for full details, including installation prerequisites.
+
+
+---
 
 
 # Online Evaluations
@@ -55,15 +86,6 @@ uv run marimo edit demos/1_evaluate.py
 # Select Y to run in a sandboxed venv
 ```
 
-### 3. Start using the package
-
-```bash
-uv pip install "git+https://github.com/microsoft/eval-recipes"
-```
-
-> [!WARNING]
-> This library is very early and everything is subject to change. Consider pinning the dependency to a commit with the command like: `uv pip install "git+https://github.com/microsoft/eval-recipes@v0.0.20"`
-
 
 ## High Level API
 

diff --git a/data/eval-setups/comparison-all.yaml → data/eval-setups/comparison-default.yaml b/data/eval-setups/comparison-all.yaml → data/eval-setups/comparison-default.yaml
diff --git a/data/eval-setups/score-default.yaml b/data/eval-setups/score-default.yaml
@@ -0,0 +1,132 @@
+type: score
+definitions:
+  - agent: amplifier_v1
+    trials: 2
+    tasks:
+      - task: arxiv_conclusion_extraction
+      - task: arxiv_paper_summarizer
+      - task: code-discrepancy-docs-knack
+      - task: code-discrepancy-docstrings-grasp
+      - task: code-discrepancy-tutorials-grasp
+      - task: cpsc_recall_monitor
+      - task: cross_repo_improvement_tool
+      - task: email_drafting
+      - task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
+      - task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
+      - task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
+      - task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
+      - task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
+      - task: gdpval_extraction
+      - task: github_docs_extractor
+      - task: image_tagging
+      - task: linkedin_drafting
+      - task: markdown_deck_converter
+      - task: news_research_tool
+      - task: pdf-hr-q4
+      - task: product_review_finder
+      - task: repo_embedding_server
+      - task: style_blender
+  - agent: amplifier_foundation
+    trials: 2
+    tasks:
+      - task: arxiv_conclusion_extraction
+      - task: arxiv_paper_summarizer
+      - task: code-discrepancy-docs-knack
+      - task: code-discrepancy-docstrings-grasp
+      - task: code-discrepancy-tutorials-grasp
+      - task: cpsc_recall_monitor
+      - task: cross_repo_improvement_tool
+      - task: email_drafting
+      - task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
+      - task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
+      - task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
+      - task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
+      - task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
+      - task: gdpval_extraction
+      - task: github_docs_extractor
+      - task: image_tagging
+      - task: linkedin_drafting
+      - task: markdown_deck_converter
+      - task: news_research_tool
+      - task: pdf-hr-q4
+      - task: product_review_finder
+      - task: repo_embedding_server
+      - task: style_blender
+  - agent: claude_code
+    trials: 2
+    tasks:
+      - task: arxiv_conclusion_extraction
+      - task: arxiv_paper_summarizer
+      - task: code-discrepancy-docs-knack
+      - task: code-discrepancy-docstrings-grasp
+      - task: code-discrepancy-tutorials-grasp
+      - task: cpsc_recall_monitor
+      - task: cross_repo_improvement_tool
+      - task: email_drafting
+      - task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
+      - task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
+      - task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
+      - task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
+      - task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
+      - task: gdpval_extraction
+      - task: github_docs_extractor
+      - task: image_tagging
+      - task: linkedin_drafting
+      - task: markdown_deck_converter
+      - task: news_research_tool
+      - task: pdf-hr-q4
+      - task: product_review_finder
+      - task: repo_embedding_server
+      - task: style_blender
+  - agent: gh_cli
+    trials: 2
+    tasks:
+      - task: arxiv_conclusion_extraction
+      - task: arxiv_paper_summarizer
+      - task: code-discrepancy-docs-knack
+      - task: code-discrepancy-docstrings-grasp
+      - task: code-discrepancy-tutorials-grasp
+      - task: cpsc_recall_monitor
+      - task: cross_repo_improvement_tool
+      - task: email_drafting
+      - task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
+      - task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
+      - task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
+      - task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
+      - task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
+      - task: gdpval_extraction
+      - task: github_docs_extractor
+      - task: image_tagging
+      - task: linkedin_drafting
+      - task: markdown_deck_converter
+      - task: news_research_tool
+      - task: pdf-hr-q4
+      - task: product_review_finder
+      - task: repo_embedding_server
+      - task: style_blender
+  - agent: openai_codex
+    trials: 2
+    tasks:
+      - task: arxiv_conclusion_extraction
+      - task: arxiv_paper_summarizer
+      - task: code-discrepancy-docs-knack
+      - task: code-discrepancy-docstrings-grasp
+      - task: code-discrepancy-tutorials-grasp
+      - task: cpsc_recall_monitor
+      - task: cross_repo_improvement_tool
+      - task: email_drafting
+      - task: frontier-science-079657b3-e215-4944-8a67-8bb5347e4f15
+      - task: frontier-science-85b4f862-d881-4a79-8c5d-3e927b486b71
+      - task: frontier-science-cb11faa6-e12a-4621-9fa7-f6f4b11a9300
+      - task: frontier-science-eec8840a-2d00-4e70-b043-0da51bd1b288
+      - task: frontier-science-f3ba1aae-2fc3-4d9b-a5a3-42bb91de4d7d
+      - task: gdpval_extraction
+      - task: github_docs_extractor
+      - task: image_tagging
+      - task: linkedin_drafting
+      - task: markdown_deck_converter
+      - task: news_research_tool
+      - task: pdf-hr-q4
+      - task: product_review_finder
+      - task: repo_embedding_server
+      - task: style_blender