Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 0 additions & 50 deletions docs/openspec/changes/add-eval-generator/design.md

This file was deleted.

17 changes: 0 additions & 17 deletions docs/openspec/changes/add-eval-generator/proposal.md

This file was deleted.

35 changes: 0 additions & 35 deletions docs/openspec/changes/add-eval-generator/specs/cli/spec.md

This file was deleted.

8 changes: 0 additions & 8 deletions docs/openspec/changes/add-eval-generator/tasks.md

This file was deleted.

59 changes: 59 additions & 0 deletions docs/openspec/changes/adopt-ts-template-prompts/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Design: TypeScript Template Literals for Evaluator Prompts

## Architecture
The core idea is to leverage TypeScript's first-class function capabilities to replace the need for a custom string templating engine for code-defined evaluators.

### `PromptTemplate` Type
We will define a type alias for the prompt generation function:

```typescript
export type PromptTemplate = (context: EvaluationContext) => string;
```

### `LlmJudgeEvaluator` Updates
The `LlmJudgeEvaluator` currently holds an optional `evaluatorTemplate` string. We will expand this to a union type:

```typescript
export interface LlmJudgeEvaluatorOptions {
// ...
readonly evaluatorTemplate?: string | PromptTemplate;
}
```

In the `evaluateWithPrompt` method, we will check the type of `evaluatorTemplate`:
1. If it's a function, we call it with the current `EvaluationContext`.
2. If it's a string (or undefined, falling back to default), we proceed with the existing string substitution logic.

### Evaluator Loading
To support the "simplified DX" where users just export a function, we need to update `resolveCustomPrompt` (or the relevant loading logic) in `orchestrator.ts`.

We will use a library like `jiti` to dynamically import TypeScript files at runtime.

**User Contract:**
The user's TypeScript file should export a function named `prompt` or a default export that matches the `PromptTemplate` signature.

```typescript
// my-evaluator.ts
import type { EvaluationContext } from '@agentv/core';

export const prompt = (context: EvaluationContext) => `
Question: ${context.promptInputs.question}
...
`;
```

**Loader Logic:**
1. Check if `prompt` path ends in `.ts` or `.js`.
2. Use `jiti` (or dynamic import) to load the module.
3. Extract the `prompt` export or `default` export.
4. Validate it is a function.
5. Return it as the `evaluatorTemplateOverride`.

## Trade-offs
- **Runtime Dependency**: Adding `jiti` adds a dependency, but enables a seamless TS experience.
- **Security**: Loading and executing user code implies trust. This is already the case with `code` evaluators, but now we are running it in-process (if using `jiti`). This is acceptable for a CLI tool intended to be run by developers on their own code.

## Alternatives
- **Nunjucks/Handlebars**: We could integrate a full templating engine, but that adds runtime weight and doesn't solve the type safety issue for code-defined evaluators.
- **JSX/Svelte**: We could use component-based rendering, but that requires a build step and is overkill for the current needs of the CLI agent.

21 changes: 21 additions & 0 deletions docs/openspec/changes/adopt-ts-template-prompts/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Adopt TypeScript Template Literals for Custom Evaluator Prompts

## Summary
Enable the use of native TypeScript template literals for defining custom evaluator prompts in code-based evaluators. This provides type safety, better developer experience, and performance benefits over string-based templating with placeholders.

## Problem
Currently, `LlmJudgeEvaluator` relies on string templates with `{{variable}}` placeholders. This approach:
- Lacks type safety: No compile-time check if variables exist in the context.
- Has limited logic: Conditional logic or loops require complex template syntax or are impossible.
- Is error-prone: Typos in placeholders are only caught at runtime.

## Solution
1. Introduce a `PromptTemplate` function type that accepts `EvaluationContext` and returns a string.
2. Update `LlmJudgeEvaluator` to accept this function type.
3. Enhance the evaluator loader to support loading `.ts` files directly. Users can simply export a `prompt` function from a TypeScript file, and AgentV will load and use it as the evaluator template.

## Impact
- **Core**: `LlmJudgeEvaluator`, `orchestrator.ts`, and loader logic.
- **DX**: Developers can write custom evaluators as simple TypeScript functions without boilerplate.
- **Dependencies**: May require adding a library like `jiti` to support runtime TypeScript loading.
- **Backward Compatibility**: Existing string-based templates will continue to work.
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Spec: Custom Evaluator Prompts

## ADDED Requirements

### Requirement: Support Function-Based Prompt Templates
The `LlmJudgeEvaluator` MUST support `PromptTemplate` functions in addition to string templates.

#### Scenario: Using a function template
Given a custom evaluator defined in code
When I pass a function as `evaluatorTemplate` that uses template literals
Then the evaluator should use the output of that function as the prompt
And the function should receive the full `EvaluationContext`

```typescript
const myEvaluator = new LlmJudgeEvaluator({
resolveJudgeProvider: myResolver,
evaluatorTemplate: (context) => `
Analyze the following:
Question: ${context.promptInputs.question}
Answer: ${context.candidate}
`
});
```

#### Scenario: Backward compatibility with string templates
Given an existing evaluator configuration using a string template
When I run the evaluator
Then it should continue to function using the string substitution logic

### Requirement: Load Prompt from TypeScript File
The system MUST support loading a `PromptTemplate` function from a user-provided TypeScript file.

#### Scenario: Loading a named export
Given a TypeScript file `my-prompt.ts` that exports a `prompt` function
And an eval case configuration that points to this file
When the evaluator runs
Then it should load the file and use the exported `prompt` function as the template

#### Scenario: Loading a default export
Given a TypeScript file `my-prompt.ts` that has a default export of a function
And an eval case configuration that points to this file
When the evaluator runs
Then it should load the file and use the default export as the template

#### Scenario: Runtime TypeScript support
Given the agentv CLI running in a standard Node.js environment
When it loads a `.ts` prompt file
Then it should successfully compile/transpile and load the module without requiring the user to pre-compile it

### Requirement: Type Definitions
The `PromptTemplate` type MUST be exported and available for consumers.

#### Scenario: Type checking
Given a developer writing a custom prompt function
When they type the function argument
Then TypeScript should infer the `EvaluationContext` type and provide autocomplete for properties like `candidate`, `promptInputs`, etc.
9 changes: 9 additions & 0 deletions docs/openspec/changes/adopt-ts-template-prompts/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Tasks: Adopt TypeScript Template Literals for Custom Evaluator Prompts

- [ ] Add `jiti` (or equivalent) dependency to `@agentv/core` <!-- id: add-dep -->
- [ ] Define `PromptTemplate` type in `packages/core/src/evaluation/types.ts` <!-- id: define-type -->
- [ ] Update `LlmJudgeEvaluatorOptions` in `packages/core/src/evaluation/evaluators.ts` to accept `PromptTemplate` <!-- id: update-options -->
- [ ] Update `LlmJudgeEvaluator` implementation to handle `PromptTemplate` functions <!-- id: update-impl -->
- [ ] Update `resolveCustomPrompt` in `orchestrator.ts` to load `.ts` files using `jiti` <!-- id: update-loader -->
- [ ] Add unit tests for function-based prompt templates and loading <!-- id: add-tests -->
- [ ] Create a sample custom evaluator using the new pattern <!-- id: create-sample -->
14 changes: 7 additions & 7 deletions docs/openspec/specs/eval-execution/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Define how eval execution formats `input_messages` into `raw_request` for candid

The system SHALL format `input_messages` to preserve conversation turn boundaries when there is actual conversational structure requiring role disambiguation.

#### Scenario: Single system and user message (backward compatibility)
#### Scenario: Single system and user message

- **WHEN** an eval case has one system message with text content and one user message in `input_messages`
- **THEN** the `raw_request.question` field contains the system text followed by the user message content, flattened without role markers
- **AND** maintains current backward-compatible behavior
- **THEN** the `raw_request.question` field contains the system text and user message formatted with role markers
- **AND** the role markers are `@[System]:` and `@[User]:`

#### Scenario: System file attachment with user message (no role markers needed)

Expand All @@ -24,7 +24,7 @@ The system SHALL format `input_messages` to preserve conversation turn boundarie

- **WHEN** an eval case has `input_messages` including one or more non-user messages (assistant, tool, etc.)
- **THEN** the `raw_request.question` field contains all messages formatted with clear turn boundaries
- **AND** each turn is prefixed with its role (`[System]:`, `[User]:`, `[Assistant]:`, `[Tool]:`) on its own line
- **AND** each turn is prefixed with its role (`@[System]:`, `@[User]:`, `@[Assistant]:`, `@[Tool]:`) on its own line
- **AND** file attachments in content blocks are embedded inline within their respective turn
- **AND** blank lines separate consecutive turns for readability

Expand Down Expand Up @@ -60,14 +60,14 @@ The system SHALL preserve conversation structure when building evaluator prompts
#### Scenario: Multi-turn conversation evaluation

- **WHEN** the evaluator judges a candidate answer from a multi-turn conversation
- **THEN** the `evaluator_raw_request.prompt` under `[[ ## question ## ]]` contains the formatted conversation with role markers
- **AND** the role markers match the format used in `raw_request.question` (e.g., `[System]:`, `[User]:`, `[Assistant]:`)
- **THEN** the `evaluator_provider_request.prompt` under `[[ ## question ## ]]` contains the formatted conversation with role markers
- **AND** the role markers match the format used in `raw_request.question` (e.g., `@[System]:`, `@[User]:`, `@[Assistant]:`)
- **AND** the evaluator can distinguish between different conversation turns

#### Scenario: Single-turn conversation evaluation

- **WHEN** the evaluator judges a candidate answer from a single-turn interaction
- **THEN** the `evaluator_raw_request.prompt` under `[[ ## question ## ]]` contains the flat-formatted question without role markers
- **THEN** the `evaluator_provider_request.prompt` under `[[ ## question ## ]]` contains the flat-formatted question without role markers
- **AND** maintains backward compatibility with existing evaluations

#### Scenario: Evaluator receives same context as candidate
Expand Down
17 changes: 8 additions & 9 deletions docs/openspec/specs/evaluation/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,47 +51,46 @@ The system SHALL instruct LLM judge evaluators to emit a single JSON object and
#### Scenario: LLM judge evaluation failure

- **WHEN** an LLM judge evaluator fails to parse a valid score
- **THEN** the system logs a warning
- **AND** returns a score of 0 with the raw response for debugging
- **THEN** the system returns a score of 0

### Requirement: Provider Integration

The system SHALL support multiple LLM providers with environment-based configuration and optional retry settings.
The system SHALL support multiple LLM providers with configuration resolved from target definitions (typically referencing environment variables) and optional retry settings.

#### Scenario: Azure OpenAI provider

- **WHEN** a test case uses the "azure-openai" provider
- **THEN** the system reads `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_DEPLOYMENT_NAME` from environment
- **THEN** the system resolves `endpoint`, `api_key`, and `deployment` from the target configuration
- **AND** invokes Azure OpenAI with the configured settings
- **AND** applies any retry configuration specified in the target definition

#### Scenario: Anthropic provider

- **WHEN** a test case uses the "anthropic" provider
- **THEN** the system reads `ANTHROPIC_API_KEY` from environment
- **THEN** the system resolves `api_key` from the target configuration
- **AND** invokes Anthropic Claude with the configured settings
- **AND** applies any retry configuration specified in the target definition

#### Scenario: Google Gemini provider

- **WHEN** a test case uses the "gemini" provider
- **THEN** the system reads `GOOGLE_API_KEY` from environment
- **AND** optionally reads `GOOGLE_GEMINI_MODEL` to override the default model
- **THEN** the system resolves `api_key` from the target configuration
- **AND** optionally resolves `model` to override the default model
- **AND** invokes Google Gemini with the configured settings
- **AND** applies any retry configuration specified in the target definition

#### Scenario: VS Code Copilot provider

- **WHEN** a test case uses the "vscode-copilot" provider
- **THEN** the system generates a structured prompt file with preread block and SHA tokens
- **THEN** the system generates a structured prompt file with preread block
- **AND** invokes the subagent library to execute the prompt
- **AND** captures the Copilot response

#### Scenario: Codex CLI provider

- **WHEN** a test case uses the "codex" provider
- **THEN** the system locates the Codex CLI executable (default `codex`, overrideable via the target)
- **AND** it mirrors guideline and attachment files into a scratch workspace, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
- **AND** it generates a prompt file with references to the original guideline and attachment files, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
- **AND** it renders the eval prompt into a single string and launches `codex exec --json` plus any configured profile, model, approval preset, and working-directory overrides defined on the target
- **AND** it verifies the Codex executable is available while delegating profile/config resolution to the CLI itself
- **AND** it parses the emitted JSONL event stream to capture the final assistant message as the provider response, attaching stdout/stderr when the CLI exits non-zero or returns malformed JSON
Expand Down
Loading