-
Notifications
You must be signed in to change notification settings - Fork 9
Open
2 / 22 of 2 issues completedOpen
Benchmark Comparison: MASArena Implementations vs Original Implementations of Multi-Agent Systems#14
2 / 22 of 2 issues completed
Copy link
Description
Description:
To evaluate the performance and consistency of the multi-agent systems implemented in MASArena, we need to compare their performance on various benchmarks with the results reported in their original papers or source code implementations. This comparison will help us validate the accuracy of our implementations and analyze any discrepancies between MASArena and the original implementations.
The following is the list of multi-agent systems to be compared:
- AgentVerse
- ChatEval
- EvoAgent
- Jarvis
- MetaGPT
- Swarm
- LLM-Debate
- MAD
- EvoMAC
- ChatDev
- CAMEL
- AutoGen
- AFlow
- ADAS
Goals:
- Evaluate each system's performance on multiple benchmarks within MASArena.
- Compare the results of MASArena implementations with those from the original implementations (papers or source code).
- Analyze potential discrepancies and identify their causes (e.g., implementation details, parameter settings, environment configurations).
Specific Tasks:
-
Benchmark Selection:
- Identify a suitable set of benchmarks that comprehensively cover the capabilities of different multi-agent systems.
-
Data Collection:
- Gather performance data from the original papers or source code implementations (e.g., experimental results from papers or benchmark tests from open-source code).
- Run the same benchmarks in MASArena and record the results.
-
Comparison Analysis:
- Compare performance metrics (e.g., accuracy, response time, task success rate) between MASArena implementations and the original implementations.
- Document and analyze significant differences, discussing possible causes (e.g., algorithmic differences, hyperparameter settings, library versions).
Implementation Considerations:
- Ensure that MASArena's implementations faithfully reproduce the core logic of the original implementations as much as possible.
- If the original implementation of some systems is unavailable (e.g., not open-sourced), implement them in MASArena based on reasonable assumptions derived from the paper descriptions.
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels