diff --git a/docs/examples/simple/README.md b/docs/examples/simple/README.md index 4995c87..e2c2982 100644 --- a/docs/examples/simple/README.md +++ b/docs/examples/simple/README.md @@ -68,3 +68,16 @@ simple/ - **`python.instructions.md`**: Python coding guidelines - **`javascript.instructions.md`**: JavaScript coding guidelines - These instruction files can be referenced in eval files to provide context + +## Next Steps + +1. **Try running the examples**: Use `agentv eval evals/coding/example-eval.yaml` +2. **Modify eval cases**: Experiment with your own test scenarios +3. **Explore advanced examples**: See [../advanced/](../advanced/) for production patterns +4. **Create your own evals**: Use these as templates for your domain + +## Related Documentation + +- [Advanced Examples](../advanced/README.md) - Production-ready scenarios and optimization +- [AgentV Schema V2](../../features/schema-v2.md) - Full schema reference +- [Local CLI Provider Guide](../../features/local-cli-provider.md) - Custom CLI integration diff --git a/docs/examples/simple/evals/coding/example-eval.yaml b/docs/examples/simple/evals/coding/example-eval.yaml index aa803f5..9fe688a 100644 --- a/docs/examples/simple/evals/coding/example-eval.yaml +++ b/docs/examples/simple/evals/coding/example-eval.yaml @@ -2,7 +2,7 @@ # Demonstrates schema features with real file references and minimal redundancy $schema: agentv-eval-v2 -description: Example showing basic features, conversation threading, multiple evaluators, and ACE optimization +description: Example showing basic features, conversation threading, multiple evaluators # File-level default target (optional) # Can be overridden per eval case via execution.target diff --git a/docs/examples/simple/evals/incident-triage/incident-severity.yaml b/docs/examples/simple/evals/incident-triage/incident-severity.yaml new file mode 100644 index 0000000..7c140bf --- /dev/null +++ b/docs/examples/simple/evals/incident-triage/incident-severity.yaml @@ -0,0 +1,583 @@ +# ACE Optimization Example - Incident Severity Classification +# This eval file is designed to work with ACE optimization +# Key features for ACE: +# 1. System prompts that elicit step-by-step reasoning (captured as generator_reasoning) +# 2. Clear expected outcomes for the Reflector to analyze +# 3. Diverse examples covering edge cases that benefit from playbook learning + +$schema: agentv-eval-v2 +description: Incident severity classification eval designed for ACE optimization + +target: default + +evalcases: + # ========================================== + # Basic severity classification with explicit reasoning + # ========================================== + - id: severity-global-maintenance + conversation_id: incident-triage + + outcome: | + Assistant correctly classifies as 'medium' severity despite global scope, + because it's planned maintenance with acceptable degradation. Reasoning should + distinguish between unplanned outages and planned maintenance windows. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Global search results delayed during planned reindex + Impact: Catalog searchable but updates appear 20 minutes late + Scope: global + Signals: Maintenance ticket CAB-512 approved, no customer complaints + + expected_messages: + - role: assistant + content: | + Let me analyze this incident step-by-step: + + 1. Scope Analysis: Global scope typically indicates high severity + 2. Planned vs Unplanned: This is approved planned maintenance (CAB-512) + 3. Business Impact: Catalog remains searchable, only update latency affected + 4. Customer Impact: No customer complaints reported + + Severity: medium + + Rationale: While global in scope, this is planned maintenance with acceptable + degradation. Core functionality (search) remains operational. The absence of + customer complaints and pre-approval reduces severity from what would otherwise + be high for a global issue. + + - id: severity-internal-low-priority + conversation_id: incident-triage + + outcome: | + Assistant correctly classifies as 'low' severity for internal-only impact + with no customer exposure. Reasoning should explain why internal scope + reduces urgency even when functionality is completely down. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Internal analytics dashboard shows stale finance data + Impact: Financial planning team delayed; no external exposure + Scope: internal + Signals: ETL job lagging 60 minutes, status yellow + + expected_messages: + - role: assistant + content: | + Let me analyze this incident step-by-step: + + 1. Scope Analysis: Internal-only, no customer-facing impact + 2. Business Impact: Planning team inconvenienced but no revenue/compliance risk + 3. Operational Impact: ETL lag is concerning but not critical + 4. External Exposure: Explicitly stated as none + + Severity: low + + Rationale: Internal scope with no customer exposure, no revenue impact, and + no compliance risk justifies low severity. While the finance team is affected, + the 60-minute lag is an inconvenience rather than a blocker. Status yellow + (vs red) confirms this is degraded but functional. + + # ========================================== + # Advanced Cases: Ambiguous Signals & Nuanced Judgment + # These cases showcase ACE's ability to learn patterns beyond simple rules + # ========================================== + + - id: severity-contextual-timing-black-friday + conversation_id: incident-triage + + outcome: | + Assistant recognizes that identical technical issue has different severity based on timing. + 3% error rate during Black Friday is 'critical', same rate during off-peak is 'medium'. + Must reference business context and opportunity cost. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Checkout service returning 503 errors at 3% rate + Impact: ~300 failed transactions per minute + Scope: global + Signals: Black Friday (Nov 29), 10x normal traffic volume, revenue trending 40% below + forecast, competitor site performance normal + + expected_messages: + - role: assistant + content: | + **Critical revenue impact during peak business period** + + A 3% error rate would normally be 'medium' severity, but context elevates this to critical: + + - Timing: Black Friday - highest revenue day of the year + - Opportunity cost: 300 failed txns/min × ~$50 avg = $15K/min revenue loss + - Volume multiplier: 10x normal traffic means 3% affects 10x more customers + - Competitive pressure: Competitors functioning normally = customers will switch + - Forecast gap: 40% below target suggests systemic issue beyond normal variance + + **Severity: critical** + + Business context transforms a manageable technical issue into an existential threat. + The same 3% error rate during off-peak (affecting 30 txns/min, $1.5K/min) would be + 'medium' severity. Black Friday has ~2-hour window for maximum revenue capture - + every minute counts. + + All hands needed: This is a "stop the line" moment. + + - id: severity-post-deployment-correlation + conversation_id: incident-triage + + outcome: | + Assistant elevates severity and confidence based on deployment correlation. + Recent code push changes this from 'investigate' to 'rollback immediately'. + Must recognize deployment timing as strong root cause signal. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: API gateway 500 errors increased from 0.1% to 2.5% + Impact: ~500 requests/min failing across all endpoints + Scope: global + Signals: Payment service v2.4.1 deployed 8 minutes ago, error spike began 30 seconds + post-deployment, no infrastructure changes, previous version (v2.4.0) stable for 2 weeks + + expected_messages: + - role: assistant + content: | + **Deployment-induced regression with clear rollback path** + + Strong temporal correlation between deployment and error spike indicates code regression: + + - Error baseline: 0.1% (10 errors/min) - consistently stable + - Post-deployment: 2.5% (500 errors/min) - 25x increase + - Timing precision: Spike began 30 seconds after deployment completion + - Stable history: v2.4.0 ran cleanly for 2 weeks + - Infrastructure: No platform changes - isolates to application code + + **Severity: high** + + Root cause confidence: 95%+ (deployment correlation) + Remediation path: Clear and fast (rollback to v2.4.0) + + The deployment correlation transforms this from "investigate obscure issue" to + "rollback immediately." While we could spend hours debugging v2.4.1, the smart + play is instant rollback to restore service, then debug offline. + + Recommended action: Initiate rollback to v2.4.0 now, conduct post-mortem on + v2.4.1 changes after service restoration. + + - id: severity-conflicting-signals-compliance + conversation_id: incident-triage + + outcome: | + Assistant navigates conflicting signals (internal scope vs regulatory risk). + Must recognize that regulatory/compliance issues override normal scope rules. + Single-user + internal + compliance = high severity. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: CFO unable to access SOX compliance dashboard for quarterly filing + Impact: Regulatory filing blocked, potential SEC penalties $500K-$5M for missed deadline + Scope: single-user, internal + Signals: Filing deadline in 4 hours, no manual workaround available, legal escalation + opened, dashboard shows "403 Forbidden" due to expired certificate + + expected_messages: + - role: assistant + content: | + **Regulatory compliance trumps scope considerations** + + Normal triage rules (single-user + internal = low) break down under regulatory pressure: + + Scope analysis: + - Technical: Single user (CFO only) + - Operational: Internal system (no customer impact) + + Risk analysis: + - Financial: $500K-$5M in SEC penalties + - Legal: Regulatory filing deadline in 4 hours + - Reputational: Public disclosure of compliance failure + - Organizational: C-suite blocked, legal escalation active + + **Severity: high** + + Compliance and regulatory issues create an exception to standard severity rubrics. + A single internal user blocked from a compliance-critical system carries greater + business risk than a regional customer-facing outage with workarounds. + + The expired certificate is trivial to fix (minutes), but the compliance deadline + is immovable. This is a "drop everything" scenario despite limited scope. + + - id: severity-gradual-degradation-memory-leak + conversation_id: incident-triage + + outcome: | + Assistant recognizes slow-moving incident requires immediate action despite + current functionality appearing normal. Must project future state based on trend. + Pattern: gradual degradation is harder to detect but equally dangerous. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Order processing service memory usage increasing linearly + Impact: Currently processing normally, 99.99% success rate maintained + Scope: global + Signals: Memory usage 45% → 78% over past 6 hours (linear trend), projected OOM in + 90 minutes, service restarts clear memory but orders in-flight fail, peak shopping + hours starting in 60 minutes + + expected_messages: + - role: assistant + content: | + **Time-critical intervention needed despite current stability** + + Current state looks healthy (99.99% success), but trajectory analysis shows crisis ahead: + + Trend analysis: + - Memory growth: 45% → 78% in 6 hours = 5.5% per hour + - Projected OOM: 90 minutes (before peak shopping hours) + - Peak traffic: Begins in 60 minutes (will accelerate memory growth) + + Risk factors: + - Restart = in-flight order failures (customer impact) + - Peak timing: Worst possible moment for service disruption + - Global scope: No failover region available + + **Severity: high** + + This is a "slow-motion incident" - the frog-boiling problem. Current metrics don't + reflect urgency, but we have 60-90 minute window to act before guaranteed outage + during peak hours. + + Strategy: Controlled intervention NOW (add capacity, investigate leak) beats emergency + restart during peak traffic. Similar to flight crews addressing gradual cabin pressure + loss before passengers notice symptoms. + + - id: severity-global-maintenance-noisy + conversation_id: incident-triage + + outcome: | + Assistant classifies as 'medium' despite user reports and noisy alerts. + Reasoning must filter signal from noise, linking latency to the active maintenance + window and confirming that core availability is preserved. Must identify red herrings. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: "Search is broken, spinning forever" - reported by 12 internal users and 3 customers + Impact: Search latency avg 4500ms (normal 200ms), zero 500 errors + Scope: global + Signals: + - Active Change Request: CHG-998 (Elasticsearch Re-indexing) - Status: In Progress + - Alert: "High CPU on ES-Nodes" (Silenced/Muted) + - Slack: Support team asking "Is search down?" + - Health Check: /health endpoint returning 200 OK + - Ancillary Alert: "Marketing-Site-Analytics" reporting data lag (unrelated) + + expected_messages: + - role: assistant + content: | + **Noise filtering required - separating expected degradation from real incidents** + + Signal analysis: + - User complaints: 15 reports of "broken" search (alarming language) + - Latency spike: 4500ms (22x degradation from 200ms baseline) + - Availability: 100% (zero errors, health check passing) + + Causality chain: + - Active CHG-998 (Elasticsearch Re-indexing) directly explains CPU spike and latency + - High CPU alert was pre-emptively silenced (change team expected this) + - Health endpoint green confirms service logic functioning correctly + + Red herrings identified: + - Marketing analytics lag is unrelated (separate data pipeline) + - Support team panic ("Is search down?") doesn't match technical reality (search is slow, not down) + + **Severity: medium** + + This is a known, planned degradation within an approved maintenance window. While + customer-facing and generating complaints, the service is available and functioning + correctly - just slowly. User reports reflect perception ("broken") not technical + reality (degraded performance during maintenance). + + Action required: Communication to support team about maintenance window, not emergency + engineering response. Similar to planned database migration scenarios where performance + temporarily degrades but availability is maintained. + + - id: severity-data-integrity-split-brain + conversation_id: incident-triage + + outcome: | + Assistant identifies critical data integrity risk despite green availability dashboards. + Must classify as 'critical' due to silent data corruption and financial record + discrepancies. Reasoning must prioritize data correctness over availability metrics. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Finance reports discrepancies in daily ledger + Impact: "Orders Table" row count does not match "Payments Table" count for last 2 hours, Delta: 150 records + Scope: global (data layer) + Signals: + - API Availability: 100% (All Green) + - Latency: Normal (120ms) + - Database Logs: "Constraint Violation" warnings appearing on Write Replica 3 + - Replication Lag: Spiked to 5000ms on Replica 3 only + + expected_messages: + - role: assistant + content: | + **Critical data integrity failure hidden behind green dashboards** + + Symptom analysis: + - 150-record discrepancy between Orders and Payments tables + - Constraint violations on Write Replica 3 + - Isolated replication lag spike (5000ms) on single replica + + Hidden failure mode - "Split Brain" scenario: + - API reports 100% availability because writes to Primary succeed + - Reads may be hitting lagging/corrupted Replica 3 (depending on load balancer routing) + - Application sees inconsistent state: orders without payments, payments without orders + - Financial reconciliation impossible with desynchronized data + + Dashboard deception: + - Availability metrics (100%) measure "can we process requests?" not "is data correct?" + - Latency normal (120ms) because corrupt replica responds quickly with wrong data + - Traditional monitoring focused on uptime misses silent corruption + + **Severity: critical** + + Data integrity violations are the highest severity class - worse than downtime. The + system is "lying" (reporting green) while permanently losing/corrupting financial records. + 150 mismatched transactions represent real money and regulatory reporting obligations. + + Immediate actions required: + 1. Stop write traffic OR isolate Replica 3 from read pool + 2. Prevent further corruption before investigating root cause + 3. Begin data reconciliation process for affected 2-hour window + + This demonstrates why data correctness must override availability in severity calculations. + A honest outage (returning errors) is preferable to silent corruption (returning wrong data). + - id: severity-regional-api-sunset-migration + conversation_id: incident-triage + + outcome: | + Assistant classifies as 'medium' despite customer complaints and regional impact. + Must recognize this is planned API deprecation with advance notice, not an unplanned outage. + Challenging because customer distress signals conflict with planned migration context. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: EU partners reporting "API authentication failed" errors on legacy v1 endpoints + Impact: 15 partners affected, migration to v2 endpoints incomplete + Scope: regional (EU) + Signals: Sunset notice sent 90 days ago (email confirmations logged), v1 deprecation + scheduled for today, v2 endpoints fully operational, migration guide published, + partners acknowledged sunset but delayed migration work + + expected_messages: + - role: assistant + content: | + **Planned deprecation with adequate notice - not an emergency incident** + + Initial assessment might suggest high severity (15 partners impacted, authentication failing), + but context reveals this is a managed sunset: + + Evidence of planned change: + - Sunset notice: 90 days advance warning (industry standard: 30-60 days) + - Partner acknowledgment: Email confirmations on record + - Alternative available: v2 endpoints fully operational + - Documentation: Migration guide published + + Partner responsibility assessment: + - Partners acknowledged sunset timeline + - Partners delayed migration work despite adequate notice + - This is a procurement/project management failure on partner side, not platform failure + + **Severity: medium** + + While 15 partners are affected, this is a planned and communicated change. The impact + is real but expected and avoidable. Severity reflects the need to support migration + (customer success concern) without treating it as an emergency platform incident. + + Response strategy: Accelerate partner migration support, not emergency rollback. Similar + to library deprecations where breaking changes are announced but adoption lags. + + - id: severity-weekend-batch-job-failure + conversation_id: incident-triage + + outcome: | + Assistant classifies as 'low' despite complete job failure and data staleness. + Must recognize that weekend batch jobs for Monday reports have recovery window. + Challenging because "complete failure" sounds critical, but temporal context matters. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Weekend sales aggregation batch job failed (Saturday 2AM run) + Impact: Executive dashboard showing Friday's data, Monday reports will be stale + Scope: internal (analytics/reporting only) + Signals: It's Saturday 8AM, business reopens Monday 9AM (37 hours available), + job typically takes 45 minutes to rerun, no customer-facing impact, + failure cause identified (disk space - easily fixable) + + expected_messages: + - role: assistant + content: | + **Temporal context creates recovery window - urgency misaligned with impact** + + Failure assessment: + - Technical severity: Complete batch job failure (sounds critical) + - Data impact: Monday reports will show stale Friday data (concerning) + + Temporal mitigation factors: + - Current time: Saturday 8AM + - Business hours resume: Monday 9AM (37 hours away) + - Job runtime: 45 minutes + - Recovery window: 37 hours to fix a 45-minute job = 49x safety margin + - Root cause: Disk space (fixable in minutes, not hours) + + **Severity: low** + + The complete failure description is misleading without time context. We have ample + recovery time before business impact occurs. This is "fix Monday morning" not + "page the on-call engineer Saturday." The failure's timing (weekend) transforms + what could be a medium-severity weekday issue into low-severity with comfortable + recovery buffer. + + Appropriate response: Schedule fix for Monday morning standup, not weekend emergency work. + Similar to backup failures discovered Saturday for weekly retention - addressable + within business cycle. + + - id: severity-partial-region-payment-gateway + conversation_id: incident-triage + + outcome: | + Assistant classifies as 'critical' despite only 8% error rate and partial region impact. + Must recognize payment failures = revenue loss, and lack of fallback/retry makes this urgent. + Challenging because percentages seem low, but payment domain context elevates severity. + + input_messages: + - role: system + content: | + You are doing first-pass incident triage. + Think step-by-step and explain your reasoning before providing the severity classification. + Always justify your severity assessment based on scope and business impact. + - role: user + content: | + Classify the severity of this incident: + + Ticket: Payment gateway returning "processor timeout" for some APAC transactions + Impact: 8% of APAC payment attempts failing, customers abandoning carts + Scope: partial regional (APAC, intermittent) + Signals: No automatic retry mechanism, no fallback processor, error rate climbing + from 2% → 8% over last 20 minutes, trend continuing, payments are one-shot + (failed = lost sale), support tickets spiking, Twitter mentions starting + + expected_messages: + - role: assistant + content: | + **Payment failures with cart abandonment = direct revenue hemorrhage** + + Initial numeric assessment appears moderate: + - 8% error rate (92% success seems acceptable?) + - Partial regional scope (not global) + - Intermittent failures (not total outage) + + Payment domain context transforms severity: + - Payment failures are permanent: No retry, no fallback, no second chance + - Failed payment = abandoned cart = lost revenue (not degraded experience) + - Error trend: 2% → 8% in 20 minutes, continuing climb + - Projected trajectory: Could reach 15-20% in next hour + - Customer behavior: Payment failure triggers immediate competitor research + + Architectural vulnerability: + - Single processor (no redundancy) + - No automatic retry logic + - One-shot failure mode + + Brand damage signals: + - Support ticket spike (reactive complaints) + - Twitter mentions (proactive public complaints - brand risk) + + **Severity: critical** + + An 8% payment error rate is catastrophic in e-commerce, even if 92% succeed. Unlike + content delivery (where retry helps) or search (where degradation is tolerable), + payment failures are binary: success = revenue, failure = $0 + angry customer. + + The climbing trend (2% → 8%) indicates systemic issue, not random variance. Immediate + intervention required before error rate reaches double digits and social media + amplifies brand damage. This is "stop the revenue bleeding" severity. \ No newline at end of file diff --git a/docs/examples/simple/evals/psychology/three-layer-listening.yaml b/docs/examples/simple/evals/psychology/three-layer-listening.yaml new file mode 100644 index 0000000..491da71 --- /dev/null +++ b/docs/examples/simple/evals/psychology/three-layer-listening.yaml @@ -0,0 +1,208 @@ +$schema: agentv-eval-v2 +description: >- + Comprehensive Evaluation Suite for Psychology Listening Skills. + TIER 1: Basic capability (Explicit Content & Emotion). + TIER 2: Advanced capability (Implicit Process, Cultural Subtext, & Resistance). +target: default + +evalcases: + # ============================================================================== + # TIER 1: BASIC BENCHMARKS (Explicit Text) + # Objective: Ensure model can identify clearly stated facts and emotions. + # Failure Mode to Watch: "Over-analysis" (Hallucinating subtext where none exists). + # ============================================================================== + + # Case 1: The Overwhelmed Wife (Adversarial: Passive Resignation) + - id: basic-overwhelmed-wife + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Resentment' or 'Despair' in emotion (not just anger). + Must identify 'Testing the relationship' or 'Silent withdrawal' in process. + Must identify 'Imbalance of labor' in content. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement using the 3-Layer Listening method. + + Client Statement: + "I used to fight with him about the chores. I used to scream when he played games while I cooked. Yesterday, I walked right past him. He was on his phone. The trash was overflowing. I didn't say a word. I just cooked dinner, ate with the kids, and went to bed. He didn't even look up. I wonder... if I stopped paying the internet bill, would he notice me then?" + expected_messages: + - role: assistant + content: |- + { + "content": "Client stopped verbally fighting about chores/neglect; observed husband's continued disengagement; considering passive-aggressive action (cutting internet).", + "emotion": "Resentment, Despair, Numbness/Detachment, Cold Anger.", + "process": "Shift from 'Pursuer' (screaming) to 'Distancer' (silence); Testing the relationship's viability; Using withdrawal as a final attempt to gain attention or preparing to detach completely." + } + + # Case 2: The Traditional Mother (Adversarial: Guilt & Reframing) + - id: basic-traditional-mother + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Financial exploitation' framed as 'Family values' in content. + Must identify 'Gaslighting' or 'Guilt-tripping' in process. + Must identify 'Daughter's boundaries' being violated in process. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement using the 3-Layer Listening method. + + Client Statement: + "My daughter is doing so well, she saves so much money. Her brother is just starting out, he's struggling. As a family, we help each other, right? I simply suggested she help him with the down payment. It's an investment for the family's future. But she stopped talking to me! I don't understand. Isn't family about sharing? Why is she being so calculated and cold with her own flesh and blood?" + expected_messages: + - role: assistant + content: |- + { + "content": "Mother asked successful daughter to pay for struggling son's down payment; Daughter cut contact; Mother frames this as 'family help' and daughter as 'cold'.", + "emotion": "Confusion (feigned or genuine), Self-righteousness, Victimhood, Indignation.", + "process": "Manipulative Reframing: Defines 'financial exploitation' as 'family values'; Gaslighting: Labels daughter's boundary setting as 'being cold/calculated'; Enmeshment: Fails to see daughter as a separate entity with own rights." + } + + # Case 3: The Returning Mother (Explicit Distress) + - id: basic-returning-mother + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Motherhood to Workforce' in content. + Must identify 'Internalizing rejection as worthlessness' in process. + Must identify 'Validation and small wins' in process. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement using the 3-Layer Listening method. + + Client Statement: + "I have a 3-year-old daughter who started kindergarten. I used to watch the child at home and felt useless, so I wanted to find a job. My husband supports me, but I sent many resumes with no response. I finally got an interview but didn't pass. I can't find a suitable job. I doubt myself—am I unable to do anything? I am in pain, unhappy every day, and don't know what to do." + expected_messages: + - role: assistant + content: |- + { + "content": "Transitioning from full-time Motherhood to Workforce; loss of previous role clarity.", + "emotion": "Pain, Unhappiness, Self-doubt, Anxiety, Confusion.", + "process": "Overgeneralization and Internalization: Views job rejection as proof of total personal incompetence ('unable to do anything'). Immediate need: Normalization of job market difficulties; Validation of her worth outside of employment; Establishing small, achievable goals." + } + + # ============================================================================== + # TIER 2: ADVANCED BENCHMARKS (Implicit/Process) + # Objective: Test "Reasoning Gap." Model must infer unstated dynamics (Somatization, Triangulation, Resistance). + # Failure Mode to Watch: "Literalism" (Accepting face-value statements that contradict the subtext). + # ============================================================================== + + # Case 4: The Alexithymic Executive (Adversarial: Denial of Distress) + # Challenge: Identify "Alexithymia" where the client denies emotion but reports symptoms. + - id: adv-somatic-mask-worker + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Alexithymia' or 'Emotional detachment' in process. + Must identify 'Throat tightness' as 'Somatic Displacement' in process. + Must identify 'Denial of stress' in content. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement using the 3-Layer Listening method. + + Client Statement: + "My life is perfect. Work is booming, just got a promotion. I'm thriving on the pressure. I just need you to recommend a specialist for this weird tightness in my throat. It's like a lump that won't go away. My wife keeps asking if I'm 'anxious', which is ridiculous. I don't do anxiety. I just have a throat issue." + expected_messages: + - role: assistant + content: |- + { + "content": "Reports 'perfect life' and career success; Denies anxiety/stress explicitly; Seeks medical fix for 'throat lump' (Globus sensation).", + "emotion": "Disconnected/Numb (Alexithymia), Defensive pride, Underlying panic (displaced to throat).", + "process": "Alexithymia: Inability to identify/verbalize feelings. Somatic Displacement: Stress is converted into physical symptoms (throat tightness) because emotional expression is viewed as 'ridiculous' or weak." + } + + # Case 5: The Benevolent Gatekeeper (Adversarial: Dysfunctional Helping) + # Challenge: Identify that "Helping communicate" is actually "Preventing intimacy". + - id: adv-sandwich-husband + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Gatekeeping' or 'Blocking direct communication' in process. + Must identify 'Distortion of reality' (translating criticism as care) in content. + Must identify 'Fear of conflict' in emotion. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement using the 3-Layer Listening method. + + Client Statement: + "I'm the bridge in this family. When Mom says my wife's cooking is 'bland', I tell my wife that Mom is just worried about her salt intake. When my wife rolls her eyes at Mom's stories, I tell Mom she's just tired from work. I work so hard to translate for them so they don't get hurt. Why do I feel so exhausted if I'm doing such a good thing?" + expected_messages: + - role: assistant + content: |- + { + "content": "Client actively 'translates' (distorts) critical messages between wife and mother to prevent conflict; Reports exhaustion.", + "emotion": "Exhaustion, Anxiety, Fear of conflict, False pride in being the 'bridge'.", + "process": "Gatekeeping/Triangulation: By 'translating', the client prevents direct contact and authentic relationship (even conflict) between the women. He maintains the dysfunction by absorbing the tension, preventing resolution." + } + + # Case 6: The Intellectualizer (Adversarial: Rationalization) + # Challenge: Identify "Intellectualization" where logic is used to avoid feeling/acting. + - id: adv-face-saving-resistance + outcome: >- + ZERO SCORE if output is not valid JSON with keys 'content', 'emotion', 'process'. + If valid JSON: Must identify 'Intellectualization' or 'Rationalization' in process. + Must identify 'Avoidance of vulnerability' in emotion. + Must identify 'Theoretical agreement vs. Practical avoidance' in content. + execution: + evaluators: + - name: json_format_validator + type: code + script: uv run validate_json_format.py + cwd: ../../evaluators/scripts + - name: content_evaluator + type: llm_judge + input_messages: + - role: user + content: |- + Analyze the following client statement (Context: Therapist suggested expressing anger to a parent). + + Client Statement: + "I've read about this. It's the 'Empty Chair' technique, right? Gestalt therapy. I understand the theoretical mechanism—catharsis reduces cortisol levels. It's fascinating how the brain processes repressed anger. I'd love to discuss the efficacy rates of this method compared to CBT before we try it." + expected_messages: + - role: assistant + content: |- + { + "content": "Client discusses the theory/mechanism of the technique ('cortisol', 'Gestalt', 'efficacy rates') instead of doing it.", + "emotion": "Detachment, Superiority (intellectual), Fear of vulnerability/loss of control.", + "process": "Intellectualization: Moving the conversation to the 'head' (theory/data) to avoid the 'heart' (experiencing the anger). Using analysis as a shield against the therapeutic intervention." + } \ No newline at end of file diff --git a/docs/examples/simple/evaluators/scripts/validate_json_format.py b/docs/examples/simple/evaluators/scripts/validate_json_format.py new file mode 100644 index 0000000..026303f --- /dev/null +++ b/docs/examples/simple/evaluators/scripts/validate_json_format.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +""" +JSON Format Validator for AgentV +Validates that the candidate answer is strictly valid JSON with required keys. +Returns score 0.0 if not valid JSON, otherwise passes to next evaluator. +""" + +import json +import sys +from typing import Any + + +def validate_json_format(candidate_answer: str, required_keys: list[str]) -> dict[str, Any]: + """ + Validate that candidate_answer is valid JSON with required keys. + + Args: + candidate_answer: The response to validate + required_keys: List of required top-level keys (e.g., ['content', 'emotion', 'process']) + + Returns: + dict with 'passed', 'score', and 'reasoning' keys + """ + # Try to parse as JSON + try: + parsed = json.loads(candidate_answer.strip()) + except json.JSONDecodeError as e: + return { + "passed": False, + "score": 0.0, + "reasoning": f"Output is not valid JSON. Parse error: {str(e)}" + } + + # Check if it's a dict (object) + if not isinstance(parsed, dict): + return { + "passed": False, + "score": 0.0, + "reasoning": f"Output is valid JSON but not an object/dict. Got: {type(parsed).__name__}" + } + + # Check for required keys + missing_keys = [key for key in required_keys if key not in parsed] + if missing_keys: + return { + "passed": False, + "score": 0.0, + "reasoning": f"Valid JSON but missing required keys: {', '.join(missing_keys)}. Has keys: {', '.join(parsed.keys())}" + } + + # All checks passed + return { + "passed": True, + "score": 1.0, + "reasoning": f"Valid JSON with all required keys: {', '.join(required_keys)}" + } + + +def main(): + """Main entry point for AgentV code evaluator.""" + # AgentV passes eval data via stdin as JSON + try: + eval_data = json.load(sys.stdin) + except json.JSONDecodeError as e: + print(json.dumps({ + "passed": False, + "score": 0.0, + "reasoning": f"Failed to parse input JSON: {str(e)}" + })) + sys.exit(1) + + # Extract candidate answer + candidate_answer = eval_data.get("candidate_answer", "") + + # Default required keys for 3-layer listening + required_keys = ["content", "emotion", "process"] + + # Validate + result = validate_json_format(candidate_answer, required_keys) + + # Output result as JSON + print(json.dumps(result, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/docs/examples/simple/optimizers/ace-code-generation.yaml b/docs/examples/simple/optimizers/ace-incident-triage.yaml similarity index 78% rename from docs/examples/simple/optimizers/ace-code-generation.yaml rename to docs/examples/simple/optimizers/ace-incident-triage.yaml index 18e58f9..88e2d9f 100644 --- a/docs/examples/simple/optimizers/ace-code-generation.yaml +++ b/docs/examples/simple/optimizers/ace-incident-triage.yaml @@ -8,22 +8,17 @@ type: ace # Eval files to use for optimization # ACE will run these evals to measure prompt performance and guide improvements +# Use eval files designed with explicit reasoning prompts for best ACE results eval_files: - - ../evals/example-eval.yaml - # - ../evals/code-generation-edge-cases.test.yaml - # - ../evals/code-review-security.test.yaml + - ../evals/incident-triage/incident-severity.yaml # ACE playbook configuration # Defines the optimization strategy and constraints -playbook_path: ./playbooks/code-generation.json +playbook_path: ./playbooks/incident-triage.json # Maximum optimization iterations across ALL eval cases max_epochs: 5 -# Number of analysis rounds per failed eval case -# ACE analyzes failures and suggests prompt improvements -max_reflector_rounds: 3 - # Allow ACE to create new sections in the prompt # When true, ACE can add new instructions/guidelines # When false, ACE can only modify existing content diff --git a/docs/examples/simple/optimizers/playbooks/code-generation.json b/docs/examples/simple/optimizers/playbooks/code-generation.json deleted file mode 100644 index b886aa3..0000000 --- a/docs/examples/simple/optimizers/playbooks/code-generation.json +++ /dev/null @@ -1,167 +0,0 @@ -{ - "version": "1.0", - "created_at": "2025-11-15T10:30:00Z", - "last_updated": "2025-11-15T14:22:00Z", - "optimization_history": { - "epochs_completed": 3, - "total_reflections": 12, - "success_rate_progression": [0.45, 0.67, 0.83] - }, - "sections": { - "core_requirements": { - "description": "Fundamental requirements for code generation", - "bullets": [ - { - "id": "req-001", - "content": "Always include type hints for function parameters and return values using the typing module", - "tags": ["typing", "best-practices"], - "confidence": 0.95, - "added_epoch": 1 - }, - { - "id": "req-002", - "content": "Raise TypeError for invalid input types and ValueError for invalid values", - "tags": ["error-handling", "validation"], - "confidence": 0.92, - "added_epoch": 1 - }, - { - "id": "req-003", - "content": "Include comprehensive docstrings with Args, Returns, and Raises sections", - "tags": ["documentation", "best-practices"], - "confidence": 0.88, - "added_epoch": 2 - } - ] - }, - "edge_case_handling": { - "description": "Critical edge cases that must be handled", - "bullets": [ - { - "id": "edge-001", - "content": "Check for empty collections before accessing elements or performing operations", - "tags": ["validation", "edge-cases"], - "confidence": 0.93, - "added_epoch": 1 - }, - { - "id": "edge-002", - "content": "Handle single-element collections that may not have enough data for the operation", - "tags": ["edge-cases"], - "confidence": 0.87, - "added_epoch": 2 - }, - { - "id": "edge-003", - "content": "For finding unique values, use set() to eliminate duplicates before processing", - "tags": ["algorithms", "edge-cases"], - "confidence": 0.91, - "added_epoch": 2 - }, - { - "id": "edge-004", - "content": "Return None or appropriate sentinel value when operation cannot be completed (e.g., finding second largest in single-element list)", - "tags": ["edge-cases", "return-values"], - "confidence": 0.89, - "added_epoch": 3 - } - ] - }, - "code_quality": { - "description": "Code quality and maintainability guidelines", - "bullets": [ - { - "id": "qual-001", - "content": "Use isinstance() for type checking rather than type() comparison", - "tags": ["validation", "pythonic"], - "confidence": 0.94, - "added_epoch": 1 - }, - { - "id": "qual-002", - "content": "Validate all elements in a collection using generator expressions with all() or any()", - "tags": ["validation", "pythonic"], - "confidence": 0.86, - "added_epoch": 2 - }, - { - "id": "qual-003", - "content": "Prefer explicit variable names over abbreviations (e.g., 'unique_numbers' not 'uniq_nums')", - "tags": ["readability"], - "confidence": 0.82, - "added_epoch": 3 - } - ] - }, - "common_pitfalls": { - "description": "Mistakes to avoid based on past failures", - "bullets": [ - { - "id": "pit-001", - "content": "Never assume list indices are valid without checking bounds or catching IndexError", - "tags": ["safety", "edge-cases"], - "confidence": 0.96, - "added_epoch": 1, - "reflection_summary": "Failed case involved accessing items[i] in loop with wrong boundary condition" - }, - { - "id": "pit-002", - "content": "When sorting is needed, create a sorted copy using sorted() rather than mutating input with .sort()", - "tags": ["immutability", "best-practices"], - "confidence": 0.84, - "added_epoch": 2, - "reflection_summary": "User expected original list to remain unchanged" - }, - { - "id": "pit-003", - "content": "For numeric operations, explicitly check for the correct numeric type (int, float) before processing", - "tags": ["validation", "type-checking"], - "confidence": 0.88, - "added_epoch": 3, - "reflection_summary": "Function failed when list contained numeric strings instead of integers" - } - ] - }, - "performance_considerations": { - "description": "Performance optimizations learned through evaluation", - "bullets": [ - { - "id": "perf-001", - "content": "Use set operations for deduplication - O(n) instead of nested loops O(n²)", - "tags": ["performance", "algorithms"], - "confidence": 0.79, - "added_epoch": 3 - }, - { - "id": "perf-002", - "content": "For finding kth largest/smallest, consider using heapq.nlargest/nsmallest for better performance on large datasets", - "tags": ["performance", "algorithms", "advanced"], - "confidence": 0.75, - "added_epoch": 3, - "note": "Applicable for lists > 100 elements" - } - ] - } - }, - "metadata": { - "total_bullets": 18, - "sections_count": 5, - "avg_confidence": 0.87, - "tags_frequency": { - "edge-cases": 6, - "validation": 5, - "best-practices": 4, - "type-checking": 3, - "pythonic": 2, - "performance": 2, - "algorithms": 3, - "error-handling": 1, - "documentation": 1, - "readability": 1, - "safety": 1, - "immutability": 1, - "return-values": 1, - "advanced": 1 - } - } -} diff --git a/docs/examples/simple/optimizers/playbooks/incident-triage.json b/docs/examples/simple/optimizers/playbooks/incident-triage.json new file mode 100644 index 0000000..5d05299 --- /dev/null +++ b/docs/examples/simple/optimizers/playbooks/incident-triage.json @@ -0,0 +1,236 @@ +{ + "version": "1.0", + "created_at": "2025-11-29T08:00:00Z", + "last_updated": "2025-11-29T08:00:00Z", + "optimization_history": { + "epochs_completed": 0, + "total_reflections": 0, + "success_rate_progression": [] + }, + "sections": { + "severity_classification_principles": { + "description": "Core principles for incident severity assessment", + "bullets": [ + { + "id": "sev-001", + "content": "Always think step-by-step and explain reasoning before providing severity classification", + "tags": ["reasoning", "process"], + "confidence": 1.0, + "added_epoch": 0 + }, + { + "id": "sev-002", + "content": "Justify severity assessment based on scope, business impact, and customer exposure", + "tags": ["assessment", "business-impact"], + "confidence": 1.0, + "added_epoch": 0 + }, + { + "id": "sev-003", + "content": "Distinguish between unplanned outages and planned maintenance windows when assessing severity", + "tags": ["planned-vs-unplanned", "context"], + "confidence": 0.95, + "added_epoch": 0 + } + ] + }, + "scope_analysis": { + "description": "How to analyze incident scope and its impact on severity", + "bullets": [ + { + "id": "scope-001", + "content": "Internal-only scope with no customer exposure typically reduces severity to low, even when functionality is down", + "tags": ["scope", "internal"], + "confidence": 0.92, + "added_epoch": 0 + }, + { + "id": "scope-002", + "content": "Global scope typically indicates high severity, but planned maintenance or acceptable degradation can reduce it to medium", + "tags": ["scope", "global", "context"], + "confidence": 0.90, + "added_epoch": 0 + }, + { + "id": "scope-003", + "content": "Regulatory and compliance issues create exceptions to standard severity rubrics - override normal scope considerations", + "tags": ["compliance", "regulatory", "exception"], + "confidence": 0.93, + "added_epoch": 0 + } + ] + }, + "contextual_factors": { + "description": "Critical contextual factors that modify severity assessment", + "bullets": [ + { + "id": "ctx-001", + "content": "Business timing context (Black Friday, peak hours) can elevate identical technical issues from medium to critical", + "tags": ["timing", "business-context", "revenue"], + "confidence": 0.94, + "added_epoch": 0 + }, + { + "id": "ctx-002", + "content": "Deployment correlation (error spike within minutes of deployment) indicates high-confidence rollback path and justifies higher severity", + "tags": ["deployment", "correlation", "root-cause"], + "confidence": 0.91, + "added_epoch": 0 + }, + { + "id": "ctx-003", + "content": "Gradual degradation requires projecting future state based on trend - act before crisis even if current metrics look healthy", + "tags": ["trends", "prediction", "proactive"], + "confidence": 0.88, + "added_epoch": 0 + }, + { + "id": "ctx-004", + "content": "Recovery window (time until business impact) reduces urgency - weekend batch job failures with Monday deadline are low severity", + "tags": ["timing", "recovery-window"], + "confidence": 0.85, + "added_epoch": 0 + } + ] + }, + "signal_filtering": { + "description": "How to separate signal from noise in incident reports", + "bullets": [ + { + "id": "sig-001", + "content": "Filter user perception ('broken') from technical reality (slow but functional) - distinguish degradation from outage", + "tags": ["noise-filtering", "perception-vs-reality"], + "confidence": 0.89, + "added_epoch": 0 + }, + { + "id": "sig-002", + "content": "Identify red herrings - unrelated alerts or symptoms that don't share root cause with primary incident", + "tags": ["noise-filtering", "causality"], + "confidence": 0.87, + "added_epoch": 0 + }, + { + "id": "sig-003", + "content": "Link symptoms to active maintenance windows or change requests - expected degradation is not an emergency", + "tags": ["maintenance", "planned-change"], + "confidence": 0.90, + "added_epoch": 0 + }, + { + "id": "sig-004", + "content": "Green availability dashboards can hide critical data integrity failures - prioritize data correctness over uptime metrics", + "tags": ["data-integrity", "hidden-failures"], + "confidence": 0.92, + "added_epoch": 0 + } + ] + }, + "domain_specific_patterns": { + "description": "Domain-specific severity escalation patterns", + "bullets": [ + { + "id": "dom-001", + "content": "Payment failures are permanent (no retry, no fallback) - even low error rates (8%) are critical due to revenue hemorrhage", + "tags": ["payments", "revenue", "permanence"], + "confidence": 0.95, + "added_epoch": 0 + }, + { + "id": "dom-002", + "content": "Data integrity violations are worse than downtime - silent corruption returning wrong data is more dangerous than honest errors", + "tags": ["data-integrity", "corruption"], + "confidence": 0.96, + "added_epoch": 0 + }, + { + "id": "dom-003", + "content": "Planned API deprecation with adequate notice (90 days) is medium severity despite customer complaints - partner responsibility issue", + "tags": ["deprecation", "sunset", "partner-responsibility"], + "confidence": 0.84, + "added_epoch": 0 + } + ] + }, + "common_pitfalls": { + "description": "Mistakes to avoid in severity assessment", + "bullets": [ + { + "id": "pit-001", + "content": "Don't let alarming language in user reports ('broken', 'down') override technical analysis of actual functionality", + "tags": ["perception", "analysis"], + "confidence": 0.88, + "added_epoch": 0 + }, + { + "id": "pit-002", + "content": "Don't assume low percentages (8% error rate) are acceptable without domain context - in payments, this is catastrophic", + "tags": ["percentages", "domain-context"], + "confidence": 0.90, + "added_epoch": 0 + }, + { + "id": "pit-003", + "content": "Don't ignore climbing error trends (2% → 8%) - trajectory matters as much as current state", + "tags": ["trends", "trajectory"], + "confidence": 0.87, + "added_epoch": 0 + }, + { + "id": "pit-004", + "content": "Don't confuse single-user scope with low severity when regulatory/compliance issues are involved", + "tags": ["compliance", "scope-exception"], + "confidence": 0.91, + "added_epoch": 0 + } + ] + } + }, + "metadata": { + "total_bullets": 24, + "sections_count": 6, + "avg_confidence": 0.91, + "tags_frequency": { + "scope": 3, + "context": 3, + "timing": 2, + "business-context": 1, + "revenue": 2, + "compliance": 2, + "regulatory": 1, + "deployment": 1, + "correlation": 1, + "root-cause": 1, + "trends": 2, + "prediction": 1, + "proactive": 1, + "recovery-window": 1, + "noise-filtering": 3, + "perception-vs-reality": 1, + "causality": 1, + "maintenance": 1, + "planned-change": 1, + "data-integrity": 2, + "hidden-failures": 1, + "payments": 1, + "permanence": 1, + "corruption": 1, + "deprecation": 1, + "sunset": 1, + "partner-responsibility": 1, + "perception": 1, + "analysis": 1, + "domain-context": 1, + "trajectory": 1, + "scope-exception": 1, + "reasoning": 1, + "process": 1, + "assessment": 1, + "business-impact": 1, + "planned-vs-unplanned": 1, + "internal": 1, + "global": 1, + "exception": 1 + } + } +} diff --git a/docs/openspec/changes/implement-ace-optimizer/design.md b/docs/openspec/changes/implement-ace-optimizer/design.md new file mode 100644 index 0000000..1330e6b --- /dev/null +++ b/docs/openspec/changes/implement-ace-optimizer/design.md @@ -0,0 +1,116 @@ +# Design: ACE Optimizer Integration + +## Architecture + +The optimization feature will be implemented as a new module in `@agentv/core` and exposed via the CLI. + +### 1. Configuration Schema +We will introduce a new configuration type for Optimizers. +File extension: `.yaml` (e.g., `optimizers/ace-code-generation.yaml`). + +```yaml +description: string +type: "ace" +eval_files: string[] # Paths to eval files +playbook_path: string # Path to save/load the playbook +max_epochs: number # Number of optimization iterations +allow_dynamic_sections: boolean # Allow ACE to create new playbook sections +``` + +**Metric Scoring**: The initial version supports a single numeric metric in the 0-1 range. This can be a simple pass/fail (0 or 1) or a weighted composite score (e.g., 0.5 for correct output + 0.15 for good reasoning). The metric is computed by evaluating the optimized prompt against the test cases in `eval_files`. + +**Playbook Structure**: Playbooks are lightweight JSON files following the Ax ACE specification: +```json +{ + "version": 1, + "sections": { + "Core Principles": [ + { + "id": "core-prin-a3f2b891", + "section": "Core Principles", + "content": "Always validate input types", + "helpfulCount": 0, + "harmfulCount": 0, + "createdAt": "2025-01-15T10:30:00.000Z", + "updatedAt": "2025-01-15T10:30:00.000Z", + "tags": ["validation"] + } + ], + "Edge Cases": [ + { + "id": "edge-cas-7d4e1c92", + "section": "Edge Cases", + "content": "Handle null/undefined gracefully", + "helpfulCount": 2, + "harmfulCount": 0, + "createdAt": "2025-01-15T10:32:00.000Z", + "updatedAt": "2025-01-15T11:45:00.000Z" + } + ] + }, + "stats": { + "bulletCount": 2, + "helpfulCount": 2, + "harmfulCount": 0, + "tokenEstimate": 45 + }, + "updatedAt": "2025-01-15T11:45:00.000Z", + "description": "Learned optimization insights" +} +``` + +Bullet IDs are auto-generated by Ax in the format `{section-prefix}-{randomHex}` (e.g., `core-prin-a3f2b891`). The playbook tracks feedback counters (`helpfulCount`, `harmfulCount`) used for pruning low-value bullets when sections grow too large. + +### 2. Core Implementation (`packages/core`) + +**New Module: `optimization`** +- `OptimizerConfig`: Zod schema for the configuration. +- `Optimizer`: Interface for optimization strategies. +- `AceOptimizer`: Implementation using `@ax-llm/ax`. + +**Integration with Evaluation** +- The `AceOptimizer` needs to run evaluations to measure performance. +- We will reuse the existing `EvaluationEngine` (or equivalent) to run the test cases defined in `eval_files`. +- The optimizer will iterate: + 1. Ax generates a candidate (or updates the playbook). + 2. AgentV runs the evals using the candidate. + 3. AgentV computes a single numeric score (0-1 range) per eval case. This can be a simple binary score or a weighted composite. + 4. Ax uses the scores to improve the playbook via reflection and curation. + +**Generator Output Requirements** +- ACE's Reflector requires access to the generator's reasoning trace to analyze which playbook bullets helped or hurt. +- AgentV must capture a `thought` or `reasoning` field from the generator output (if present). +- Implementation approach: + - Extend `EvaluationResult` type to include optional `reasoning?: string` field (already exists in current schema). + - Extract reasoning from the candidate answer if it contains a structured `thought` field. + - Pass this reasoning to ACE as `generator_reasoning` input for the Reflector. +- For eval cases where the LLM naturally outputs reasoning (e.g., Chain-of-Thought prompts), this will be captured automatically. +- For cases without explicit reasoning output, the Reflector will work with just the answer and feedback signals. + +### 3. CLI Implementation (`apps/cli`) + +**New Command: `optimize`** +- Usage: `agentv optimize ` +- Responsibilities: + - Load and validate the optimizer config. + - Resolve paths (eval files, playbook). + - Instantiate the optimizer. + - Run the optimization loop. + - Report progress to the console. + +### 4. Ax Integration +- We will add `@ax-llm/ax` as a dependency to `@agentv/core`. +- We will map AgentV's `EvalResult` to the feedback format expected by Ax's ACE. + +## Data Flow + +1. **User** runs `agentv optimize my-config.yaml`. +2. **CLI** reads `my-config.yaml`. +3. **CLI** initializes `AceOptimizer` with the config. +4. **AceOptimizer** loads the `playbook` (if exists). +5. **AceOptimizer** starts the optimization loop. +6. **AceOptimizer** requests evaluation of the current state. +7. **EvaluationRunner** (from core) runs the specified `eval_files`. +8. **EvaluationRunner** returns metrics (e.g., correctness score). +9. **AceOptimizer** updates the `playbook` based on metrics and reflection. +10. **AceOptimizer** saves the `playbook` to disk. diff --git a/docs/openspec/changes/implement-ace-optimizer/proposal.md b/docs/openspec/changes/implement-ace-optimizer/proposal.md new file mode 100644 index 0000000..18f673e --- /dev/null +++ b/docs/openspec/changes/implement-ace-optimizer/proposal.md @@ -0,0 +1,20 @@ +# Implement ACE Optimizer + +## Summary +Implement automatic prompt optimization in AgentV using the Ax library's ACE (Automatic Cognitive Enhancement) algorithm. This allows users to automatically improve their prompts based on evaluation results defined in AgentV eval files. + +## Problem +Currently, AgentV provides evaluation capabilities, allowing users to score their agents/prompts against test cases. However, improving the prompts based on these scores is a manual and iterative process. Users have to manually analyze failures, tweak the prompt, and re-run evals. + +## Solution +Integrate the Ax library's ACE optimization algorithm into AgentV. +- Introduce a new `optimize` command in the CLI. +- Define a YAML configuration schema for optimizers (minimal viable set: type, eval_files, playbook_path, max_epochs, allow_dynamic_sections). +- Implement an optimization engine in `@agentv/core` that bridges AgentV's evaluation system with Ax's optimization loop. +- Support the generation and updating of "playbooks" (lightweight JSON files containing structured optimization insights as tagged bullets organized into sections). +- Initial version supports a single numeric objective (0-1 range), which can be a weighted combination of existing eval metrics. + +## Risks +- **Dependency Complexity**: Integrating `@ax-llm/ax` might introduce complex dependencies or version conflicts. +- **Performance**: Optimization loops can be slow and costly (LLM calls). We need to ensure the user has control over costs (e.g., max epochs). +- **Integration**: Mapping AgentV's evaluation results (which can be complex multi-objective scores) to a single scalar score required by some optimizers (though ACE might handle more complex feedback) needs careful design. diff --git a/docs/openspec/changes/implement-ace-optimizer/specs/ace-optimization/spec.md b/docs/openspec/changes/implement-ace-optimizer/specs/ace-optimization/spec.md new file mode 100644 index 0000000..9c301f7 --- /dev/null +++ b/docs/openspec/changes/implement-ace-optimizer/specs/ace-optimization/spec.md @@ -0,0 +1,50 @@ +# Spec: ACE Optimization + +## ADDED Requirements + +### CLI Command: `optimize` + +The CLI shall support an `optimize` command to trigger the optimization process. + +#### Scenario: Run optimization +Given a valid optimizer configuration file `ace-config.yaml` +When I run `agentv optimize ace-config.yaml` +Then the optimization process starts +And progress is displayed in the terminal +And the playbook is updated upon completion + +### Configuration: ACE Type + +The system shall support parsing and validating optimizer configurations with `type: ace`. + +#### Scenario: Valid configuration +Given a YAML file with: +```yaml +type: ace +eval_files: ["./evals/test.yaml"] +playbook_path: "./playbooks/code.json" +max_epochs: 5 +``` +When the configuration is loaded +Then it is recognized as an ACE optimizer configuration +And all fields are correctly parsed + +### Execution: Run Evals + +The optimizer shall execute the specified evaluation files to measure performance. + +#### Scenario: Evaluation during optimization +Given an optimizer running an epoch +When it needs to evaluate the current performance +Then it runs the test cases defined in `eval_files` +And aggregates the results into a score + +### Output: Playbook + +The optimizer shall save the learned optimization data to the specified `playbook_path`. + +#### Scenario: Save playbook +Given an optimization run completes successfully +When the process finishes +Then a JSON file is created or updated at `playbook_path` +And it contains the learned playbook data diff --git a/docs/openspec/changes/implement-ace-optimizer/tasks.md b/docs/openspec/changes/implement-ace-optimizer/tasks.md new file mode 100644 index 0000000..795d56d --- /dev/null +++ b/docs/openspec/changes/implement-ace-optimizer/tasks.md @@ -0,0 +1,10 @@ +# Tasks: Implement ACE Optimizer + +- [ ] Add `@ax-llm/ax` dependency to `@agentv/core` +- [ ] Define `OptimizerConfig` Zod schema in `@agentv/core` +- [ ] Create `Optimizer` interface in `@agentv/core` +- [ ] Implement `AceOptimizer` class in `@agentv/core` using Ax +- [ ] Implement `optimize` command in `@agentv/cli` +- [ ] Integrate evaluation runner with `AceOptimizer` +- [ ] Add unit tests for `AceOptimizer` configuration parsing +- [ ] Add integration test for `optimize` command (mocking Ax if necessary)