Skip to content

Conversation

@Chibionos
Copy link
Contributor

@Chibionos Chibionos commented Jan 5, 2026

Summary

  • Create a new runtime with unique runtime_id for each eval execution
  • This ensures each eval has its own LangGraph thread_id with clean state
  • Prevents message accumulation across sequential eval runs

Problem

Previously, all evals shared a single runtime with the same thread_id, causing the LangGraph checkpointer to persist and accumulate messages across eval runs. This led to 400 bad request errors on sequential eval executions.

Root Cause Analysis

Regression Source

This bug was introduced in PR #1055 (akshaya/single_eval_runtime) by @akshaylive, merged on Dec 30, 2025.

Field Value
PR #1055
Commit e51d942
Title "refactor(EvalRuntimeInstance): evaluate specific runtime instance"
Rationale "Doing this will avoid creation of temporary runtimes everywhere"

What Changed

Before PR #1055 - Each evaluation created its own runtime with a unique runtime_id:

# In execute_runtime() - called per eval
runtime = await self.factory.new_runtime(
    entrypoint=self.context.entrypoint or "",
    runtime_id=execution_id,  # Unique per eval
)

After PR #1055 - A single runtime was created and shared across all evaluations:

# In execute() - called once for all evals  
runtime = await self.factory.new_runtime(
    entrypoint=self.context.entrypoint or "",
    runtime_id=self.execution_id,  # Same for ALL evals
)

Why This Breaks Sequential Evals

When using LangGraph agents with SQLite checkpointer:

  1. runtime_id maps to thread_id for LangGraph conversation state
  2. All evals shared the same thread_id
  3. First eval runs fine, stores conversation state in checkpointer
  4. Second eval tries to continue from corrupted/unexpected state → 400 error

Why Testing Missed It

The original PR was tested with the calculator agent which doesn't use LangGraph with persistent state. The issue only manifests with stateful LangGraph agents where conversation history accumulates across the shared thread.

Solution

Each eval execution now gets its own runtime with a unique runtime_id (using execution_id). The runtime is properly disposed after the eval completes.

Test plan

  • Run evaluations with multiple sequential evals (e.g., calculator_same_as_agent with 5 evals)
  • Verify all evals run with isolated state
  • Verify linting passes

🤖 Generated with Claude Code

@github-actions github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Jan 5, 2026
Each eval execution now gets its own runtime with a unique runtime_id
(using execution_id). This ensures each eval has its own LangGraph
thread_id with clean state, preventing message accumulation across
sequential eval runs.

Previously, all evals shared a single runtime with the same thread_id,
causing the LangGraph checkpointer to persist and accumulate messages
across eval runs, leading to 400 bad request errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Chibionos Chibionos force-pushed the fix/eval-sequential-runtime-isolation branch from 9c4fa08 to cc1896b Compare January 6, 2026 00:05
Copy link
Contributor

@saksharthakkar saksharthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment, other than that... lgtm!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
If new_runtime fails, eval_runtime would be unassigned and the finally
block would raise NameError when trying to dispose. Initialize to None
and check before disposing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Collaborator

@akshaylive akshaylive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@akshaylive
Copy link
Collaborator

One minor comment: can you please rename to eval_execution_id and runtime_execution_id for clarity? I think the lack of distinction is what lead to the bug to begin with.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Chibionos Chibionos merged commit 2a557c8 into main Jan 6, 2026
114 of 115 checks passed
@Chibionos Chibionos deleted the fix/eval-sequential-runtime-isolation branch January 6, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants