Description:
Currently, our evaluation framework primarily supports single-turn or simple interaction tasks. However, many real-world agent scenarios, such as those in the GAIA benchmark , involve multi-turn dialogues and complex, sequential decision-making processes.
To better support these advanced use cases, we need to enhance the framework to:
- Support multi-turn task evaluations
- Track agent behavior across multiple steps
- Provide metrics for task completion, dialogue flow, and agent performance in multi-step scenarios
Proposed Features:
- Support for logging and analyzing agent responses across turns
- Integration with existing evaluation metrics (e.g., accuracy, reward, task success)
- Example implementations for benchmarks like GAIA