EntityProcess · christso · Dec 7, 2025
diff --git a/docs/openspec/changes/add-eval-generator/design.md b/docs/openspec/changes/add-eval-generator/design.md
diff --git a/docs/openspec/changes/add-eval-generator/proposal.md b/docs/openspec/changes/add-eval-generator/proposal.md
diff --git a/docs/openspec/changes/add-eval-generator/specs/cli/spec.md b/docs/openspec/changes/add-eval-generator/specs/cli/spec.md
diff --git a/docs/openspec/changes/add-eval-generator/tasks.md b/docs/openspec/changes/add-eval-generator/tasks.md
diff --git a/docs/openspec/changes/adopt-ts-template-prompts/design.md b/docs/openspec/changes/adopt-ts-template-prompts/design.md
@@ -0,0 +1,59 @@
+# Design: TypeScript Template Literals for Evaluator Prompts
+
+## Architecture
+The core idea is to leverage TypeScript's first-class function capabilities to replace the need for a custom string templating engine for code-defined evaluators.
+
+### `PromptTemplate` Type
+We will define a type alias for the prompt generation function:
+
+```typescript
+export type PromptTemplate = (context: EvaluationContext) => string;
+```
+
+### `LlmJudgeEvaluator` Updates
+The `LlmJudgeEvaluator` currently holds an optional `evaluatorTemplate` string. We will expand this to a union type:
+
+```typescript
+export interface LlmJudgeEvaluatorOptions {
+  // ...
+  readonly evaluatorTemplate?: string | PromptTemplate;
+}
+```
+
+In the `evaluateWithPrompt` method, we will check the type of `evaluatorTemplate`:
+1. If it's a function, we call it with the current `EvaluationContext`.
+2. If it's a string (or undefined, falling back to default), we proceed with the existing string substitution logic.
+
+### Evaluator Loading
+To support the "simplified DX" where users just export a function, we need to update `resolveCustomPrompt` (or the relevant loading logic) in `orchestrator.ts`.
+
+We will use a library like `jiti` to dynamically import TypeScript files at runtime.
+
+**User Contract:**
+The user's TypeScript file should export a function named `prompt` or a default export that matches the `PromptTemplate` signature.
+
+```typescript
+// my-evaluator.ts
+import type { EvaluationContext } from '@agentv/core';
+
+export const prompt = (context: EvaluationContext) => `
+  Question: ${context.promptInputs.question}
+  ...
+`;
+```
+
+**Loader Logic:**
+1.  Check if `prompt` path ends in `.ts` or `.js`.
+2.  Use `jiti` (or dynamic import) to load the module.
+3.  Extract the `prompt` export or `default` export.
+4.  Validate it is a function.
+5.  Return it as the `evaluatorTemplateOverride`.
+
+## Trade-offs
+- **Runtime Dependency**: Adding `jiti` adds a dependency, but enables a seamless TS experience.
+- **Security**: Loading and executing user code implies trust. This is already the case with `code` evaluators, but now we are running it in-process (if using `jiti`). This is acceptable for a CLI tool intended to be run by developers on their own code.
+
+## Alternatives
+- **Nunjucks/Handlebars**: We could integrate a full templating engine, but that adds runtime weight and doesn't solve the type safety issue for code-defined evaluators.
+- **JSX/Svelte**: We could use component-based rendering, but that requires a build step and is overkill for the current needs of the CLI agent.
+
diff --git a/docs/openspec/changes/adopt-ts-template-prompts/proposal.md b/docs/openspec/changes/adopt-ts-template-prompts/proposal.md
@@ -0,0 +1,21 @@
+# Adopt TypeScript Template Literals for Custom Evaluator Prompts
+
+## Summary
+Enable the use of native TypeScript template literals for defining custom evaluator prompts in code-based evaluators. This provides type safety, better developer experience, and performance benefits over string-based templating with placeholders.
+
+## Problem
+Currently, `LlmJudgeEvaluator` relies on string templates with `{{variable}}` placeholders. This approach:
+- Lacks type safety: No compile-time check if variables exist in the context.
+- Has limited logic: Conditional logic or loops require complex template syntax or are impossible.
+- Is error-prone: Typos in placeholders are only caught at runtime.
+
+## Solution
+1.  Introduce a `PromptTemplate` function type that accepts `EvaluationContext` and returns a string.
+2.  Update `LlmJudgeEvaluator` to accept this function type.
+3.  Enhance the evaluator loader to support loading `.ts` files directly. Users can simply export a `prompt` function from a TypeScript file, and AgentV will load and use it as the evaluator template.
+
+## Impact
+- **Core**: `LlmJudgeEvaluator`, `orchestrator.ts`, and loader logic.
+- **DX**: Developers can write custom evaluators as simple TypeScript functions without boilerplate.
+- **Dependencies**: May require adding a library like `jiti` to support runtime TypeScript loading.
+- **Backward Compatibility**: Existing string-based templates will continue to work.
diff --git a/...enspec/changes/adopt-ts-template-prompts/specs/custom-evaluator-prompts/spec.md b/...enspec/changes/adopt-ts-template-prompts/specs/custom-evaluator-prompts/spec.md
@@ -0,0 +1,56 @@
+# Spec: Custom Evaluator Prompts
+
+## ADDED Requirements
+
+### Requirement: Support Function-Based Prompt Templates
+The `LlmJudgeEvaluator` MUST support `PromptTemplate` functions in addition to string templates.
+
+#### Scenario: Using a function template
+Given a custom evaluator defined in code
+When I pass a function as `evaluatorTemplate` that uses template literals
+Then the evaluator should use the output of that function as the prompt
+And the function should receive the full `EvaluationContext`
+
+```typescript
+const myEvaluator = new LlmJudgeEvaluator({
+  resolveJudgeProvider: myResolver,
+  evaluatorTemplate: (context) => `
+    Analyze the following:
+    Question: ${context.promptInputs.question}
+    Answer: ${context.candidate}
+  `
+});
+```
+
+#### Scenario: Backward compatibility with string templates
+Given an existing evaluator configuration using a string template
+When I run the evaluator
+Then it should continue to function using the string substitution logic
+
+### Requirement: Load Prompt from TypeScript File
+The system MUST support loading a `PromptTemplate` function from a user-provided TypeScript file.
+
+#### Scenario: Loading a named export
+Given a TypeScript file `my-prompt.ts` that exports a `prompt` function
+And an eval case configuration that points to this file
+When the evaluator runs
+Then it should load the file and use the exported `prompt` function as the template
+
+#### Scenario: Loading a default export
+Given a TypeScript file `my-prompt.ts` that has a default export of a function
+And an eval case configuration that points to this file
+When the evaluator runs
+Then it should load the file and use the default export as the template
+
+#### Scenario: Runtime TypeScript support
+Given the agentv CLI running in a standard Node.js environment
+When it loads a `.ts` prompt file
+Then it should successfully compile/transpile and load the module without requiring the user to pre-compile it
+
+### Requirement: Type Definitions
+The `PromptTemplate` type MUST be exported and available for consumers.
+
+#### Scenario: Type checking
+Given a developer writing a custom prompt function
+When they type the function argument
+Then TypeScript should infer the `EvaluationContext` type and provide autocomplete for properties like `candidate`, `promptInputs`, etc.
diff --git a/docs/openspec/changes/adopt-ts-template-prompts/tasks.md b/docs/openspec/changes/adopt-ts-template-prompts/tasks.md
@@ -0,0 +1,9 @@
+# Tasks: Adopt TypeScript Template Literals for Custom Evaluator Prompts
+
+- [ ] Add `jiti` (or equivalent) dependency to `@agentv/core` <!-- id: add-dep -->
+- [ ] Define `PromptTemplate` type in `packages/core/src/evaluation/types.ts` <!-- id: define-type -->
+- [ ] Update `LlmJudgeEvaluatorOptions` in `packages/core/src/evaluation/evaluators.ts` to accept `PromptTemplate` <!-- id: update-options -->
+- [ ] Update `LlmJudgeEvaluator` implementation to handle `PromptTemplate` functions <!-- id: update-impl -->
+- [ ] Update `resolveCustomPrompt` in `orchestrator.ts` to load `.ts` files using `jiti` <!-- id: update-loader -->
+- [ ] Add unit tests for function-based prompt templates and loading <!-- id: add-tests -->
+- [ ] Create a sample custom evaluator using the new pattern <!-- id: create-sample -->
diff --git a/docs/openspec/specs/eval-execution/spec.md b/docs/openspec/specs/eval-execution/spec.md
@@ -7,11 +7,11 @@ Define how eval execution formats `input_messages` into `raw_request` for candid
 
 The system SHALL format `input_messages` to preserve conversation turn boundaries when there is actual conversational structure requiring role disambiguation.
 
-#### Scenario: Single system and user message (backward compatibility)
+#### Scenario: Single system and user message
 
 - **WHEN** an eval case has one system message with text content and one user message in `input_messages`
-- **THEN** the `raw_request.question` field contains the system text followed by the user message content, flattened without role markers
-- **AND** maintains current backward-compatible behavior
+- **THEN** the `raw_request.question` field contains the system text and user message formatted with role markers
+- **AND** the role markers are `@[System]:` and `@[User]:`
 
 #### Scenario: System file attachment with user message (no role markers needed)
 
@@ -24,7 +24,7 @@ The system SHALL format `input_messages` to preserve conversation turn boundarie
 
 - **WHEN** an eval case has `input_messages` including one or more non-user messages (assistant, tool, etc.)
 - **THEN** the `raw_request.question` field contains all messages formatted with clear turn boundaries
-- **AND** each turn is prefixed with its role (`[System]:`, `[User]:`, `[Assistant]:`, `[Tool]:`) on its own line
+- **AND** each turn is prefixed with its role (`@[System]:`, `@[User]:`, `@[Assistant]:`, `@[Tool]:`) on its own line
 - **AND** file attachments in content blocks are embedded inline within their respective turn
 - **AND** blank lines separate consecutive turns for readability
 
@@ -60,14 +60,14 @@ The system SHALL preserve conversation structure when building evaluator prompts
 #### Scenario: Multi-turn conversation evaluation
 
 - **WHEN** the evaluator judges a candidate answer from a multi-turn conversation
-- **THEN** the `evaluator_raw_request.prompt` under `[[ ## question ## ]]` contains the formatted conversation with role markers
-- **AND** the role markers match the format used in `raw_request.question` (e.g., `[System]:`, `[User]:`, `[Assistant]:`)
+- **THEN** the `evaluator_provider_request.prompt` under `[[ ## question ## ]]` contains the formatted conversation with role markers
+- **AND** the role markers match the format used in `raw_request.question` (e.g., `@[System]:`, `@[User]:`, `@[Assistant]:`)
 - **AND** the evaluator can distinguish between different conversation turns
 
 #### Scenario: Single-turn conversation evaluation
 
 - **WHEN** the evaluator judges a candidate answer from a single-turn interaction
-- **THEN** the `evaluator_raw_request.prompt` under `[[ ## question ## ]]` contains the flat-formatted question without role markers
+- **THEN** the `evaluator_provider_request.prompt` under `[[ ## question ## ]]` contains the flat-formatted question without role markers
 - **AND** maintains backward compatibility with existing evaluations
 
 #### Scenario: Evaluator receives same context as candidate

diff --git a/docs/openspec/specs/evaluation/spec.md b/docs/openspec/specs/evaluation/spec.md
@@ -51,47 +51,46 @@ The system SHALL instruct LLM judge evaluators to emit a single JSON object and
 #### Scenario: LLM judge evaluation failure
 
 - **WHEN** an LLM judge evaluator fails to parse a valid score
-- **THEN** the system logs a warning
-- **AND** returns a score of 0 with the raw response for debugging
+- **THEN** the system returns a score of 0
 
 ### Requirement: Provider Integration
 
-The system SHALL support multiple LLM providers with environment-based configuration and optional retry settings.
+The system SHALL support multiple LLM providers with configuration resolved from target definitions (typically referencing environment variables) and optional retry settings.
 
 #### Scenario: Azure OpenAI provider
 
 - **WHEN** a test case uses the "azure-openai" provider
-- **THEN** the system reads `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_DEPLOYMENT_NAME` from environment
+- **THEN** the system resolves `endpoint`, `api_key`, and `deployment` from the target configuration
 - **AND** invokes Azure OpenAI with the configured settings
 - **AND** applies any retry configuration specified in the target definition
 
 #### Scenario: Anthropic provider
 
 - **WHEN** a test case uses the "anthropic" provider
-- **THEN** the system reads `ANTHROPIC_API_KEY` from environment
+- **THEN** the system resolves `api_key` from the target configuration
 - **AND** invokes Anthropic Claude with the configured settings
 - **AND** applies any retry configuration specified in the target definition
 
 #### Scenario: Google Gemini provider
 
 - **WHEN** a test case uses the "gemini" provider
-- **THEN** the system reads `GOOGLE_API_KEY` from environment
-- **AND** optionally reads `GOOGLE_GEMINI_MODEL` to override the default model
+- **THEN** the system resolves `api_key` from the target configuration
+- **AND** optionally resolves `model` to override the default model
 - **AND** invokes Google Gemini with the configured settings
 - **AND** applies any retry configuration specified in the target definition
 
 #### Scenario: VS Code Copilot provider
 
 - **WHEN** a test case uses the "vscode-copilot" provider
-- **THEN** the system generates a structured prompt file with preread block and SHA tokens
+- **THEN** the system generates a structured prompt file with preread block
 - **AND** invokes the subagent library to execute the prompt
 - **AND** captures the Copilot response
 
 #### Scenario: Codex CLI provider
 
 - **WHEN** a test case uses the "codex" provider
 - **THEN** the system locates the Codex CLI executable (default `codex`, overrideable via the target)
-- **AND** it mirrors guideline and attachment files into a scratch workspace, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
+- **AND** it generates a prompt file with references to the original guideline and attachment files, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
 - **AND** it renders the eval prompt into a single string and launches `codex exec --json` plus any configured profile, model, approval preset, and working-directory overrides defined on the target
 - **AND** it verifies the Codex executable is available while delegating profile/config resolution to the CLI itself
 - **AND** it parses the emitted JSONL event stream to capture the final assistant message as the provider response, attaching stdout/stderr when the CLI exits non-zero or returns malformed JSON