Skip to content

Support Multi-step Agent Tasks in Evaluation Framework (e.g., GAIA's Multi-turn Tasks) #55

@RuishanFang

Description

@RuishanFang

Description:
Currently, our evaluation framework primarily supports single-turn or simple interaction tasks. However, many real-world agent scenarios, such as those in the GAIA benchmark , involve multi-turn dialogues and complex, sequential decision-making processes.

To better support these advanced use cases, we need to enhance the framework to:

  • Support multi-turn task evaluations
  • Track agent behavior across multiple steps
  • Provide metrics for task completion, dialogue flow, and agent performance in multi-step scenarios

Proposed Features:

  • Support for logging and analyzing agent responses across turns
  • Integration with existing evaluation metrics (e.g., accuracy, reward, task success)
  • Example implementations for benchmarks like GAIA

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions