This repository was archived by the owner on Nov 15, 2025. It is now read-only.

Description
Problem
Black-box agents (e.g., GitHub Copilot, legacy Codex) can't be directly optimized, but users want DSPy-like tuning for their outputs (e.g., refine generated code against guidelines).
Proposed Solution
Introduce --mode external-wrapper in bbeval opt:
- Workflow: Mock external agent output → DSPy post-processor (e.g., ChainOfThought verifier + refiner) optimized on testset.
- Input: API mock for Copilot (e.g., via VS Code extension hook); attach guidelines.
- Optimization: Tune refiner prompt (≤5 trials) for tasks like "Fix Copilot's SQL per rules."
- Output: JSON wrapper script (e.g., Python callable) + optimized refine prompt.
- Validation: Use code_execution for output testing.