fix: create isolated runtime per eval execution #1066

Chibionos · 2026-01-05T23:59:11Z

Summary

Create a new runtime with unique runtime_id for each eval execution
This ensures each eval has its own LangGraph thread_id with clean state
Prevents message accumulation across sequential eval runs

Problem

Previously, all evals shared a single runtime with the same thread_id, causing the LangGraph checkpointer to persist and accumulate messages across eval runs. This led to 400 bad request errors on sequential eval executions.

Root Cause Analysis

Regression Source

This bug was introduced in PR #1055 (akshaya/single_eval_runtime) by @akshaylive, merged on Dec 30, 2025.

Field	Value
PR	#1055
Commit	`e51d942`
Title	"refactor(EvalRuntimeInstance): evaluate specific runtime instance"
Rationale	"Doing this will avoid creation of temporary runtimes everywhere"

What Changed

Before PR #1055 - Each evaluation created its own runtime with a unique runtime_id:

# In execute_runtime() - called per eval
runtime = await self.factory.new_runtime(
    entrypoint=self.context.entrypoint or "",
    runtime_id=execution_id,  # Unique per eval
)

After PR #1055 - A single runtime was created and shared across all evaluations:

# In execute() - called once for all evals  
runtime = await self.factory.new_runtime(
    entrypoint=self.context.entrypoint or "",
    runtime_id=self.execution_id,  # Same for ALL evals
)

Why This Breaks Sequential Evals

When using LangGraph agents with SQLite checkpointer:

runtime_id maps to thread_id for LangGraph conversation state
All evals shared the same thread_id
First eval runs fine, stores conversation state in checkpointer
Second eval tries to continue from corrupted/unexpected state → 400 error

Why Testing Missed It

The original PR was tested with the calculator agent which doesn't use LangGraph with persistent state. The issue only manifests with stateful LangGraph agents where conversation history accumulates across the shared thread.

Solution

Each eval execution now gets its own runtime with a unique runtime_id (using execution_id). The runtime is properly disposed after the eval completes.

Test plan

Run evaluations with multiple sequential evals (e.g., calculator_same_as_agent with 5 evals)
Verify all evals run with isolated state
Verify linting passes

🤖 Generated with Claude Code

Each eval execution now gets its own runtime with a unique runtime_id (using execution_id). This ensures each eval has its own LangGraph thread_id with clean state, preventing message accumulation across sequential eval runs. Previously, all evals shared a single runtime with the same thread_id, causing the LangGraph checkpointer to persist and accumulate messages across eval runs, leading to 400 bad request errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/uipath/_cli/_evals/_runtime.py

saksharthakkar

left a comment, other than that... lgtm!

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

If new_runtime fails, eval_runtime would be unassigned and the finally block would raise NameError when trying to dispose. Initialize to None and check before disposing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

akshaylive

Thanks for the fix!

akshaylive · 2026-01-06T02:22:20Z

One minor comment: can you please rename to eval_execution_id and runtime_execution_id for clarity? I think the lack of distinction is what lead to the bug to begin with.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Jan 5, 2026

Chibionos force-pushed the fix/eval-sequential-runtime-isolation branch from 9c4fa08 to cc1896b Compare January 6, 2026 00:05

saksharthakkar reviewed Jan 6, 2026

View reviewed changes

src/uipath/_cli/_evals/_runtime.py Outdated Show resolved Hide resolved

saksharthakkar approved these changes Jan 6, 2026

View reviewed changes

chore: bump version to 2.4.3

5396239

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mathurk approved these changes Jan 6, 2026

View reviewed changes

akshaylive approved these changes Jan 6, 2026

View reviewed changes

chore: update uv.lock after version bump

bedb6d9

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Chibionos merged commit 2a557c8 into main Jan 6, 2026
114 of 115 checks passed

Chibionos deleted the fix/eval-sequential-runtime-isolation branch January 6, 2026 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: create isolated runtime per eval execution #1066

fix: create isolated runtime per eval execution #1066

Uh oh!

Chibionos commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

saksharthakkar left a comment

Uh oh!

akshaylive left a comment

Uh oh!

akshaylive commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: create isolated runtime per eval execution #1066

fix: create isolated runtime per eval execution #1066

Uh oh!

Conversation

Chibionos commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause Analysis

Regression Source

What Changed

Why This Breaks Sequential Evals

Why Testing Missed It

Solution

Test plan

Uh oh!

Uh oh!

saksharthakkar left a comment

Choose a reason for hiding this comment

Uh oh!

akshaylive left a comment

Choose a reason for hiding this comment

Uh oh!

akshaylive commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chibionos commented Jan 5, 2026 •

edited

Loading