Benchmark Comparison: MASArena Implementations vs Original Implementations of Multi-Agent Systems

**Description:**
To evaluate the performance and consistency of the multi-agent systems implemented in **MASArena**, we need to compare their performance on various benchmarks with the results reported in their original papers or source code implementations. This comparison will help us validate the accuracy of our implementations and analyze any discrepancies between **MASArena** and the original implementations.

The following is the list of multi-agent systems to be compared:
- [ ] AgentVerse
- [ ] ChatEval
- [ ] EvoAgent
- [ ] Jarvis
- [ ] MetaGPT
- [ ] Swarm
- [ ] LLM-Debate
- [ ] MAD
- [ ] EvoMAC
- [ ] ChatDev
- [ ] CAMEL
- [ ] AutoGen
- [ ] AFlow
- [ ] ADAS

**Goals:**
1. Evaluate each system's performance on multiple benchmarks within **MASArena**.
2. Compare the results of **MASArena** implementations with those from the original implementations (papers or source code).
3. Analyze potential discrepancies and identify their causes (e.g., implementation details, parameter settings, environment configurations).

**Specific Tasks:**
1. **Benchmark Selection**:
   - Identify a suitable set of benchmarks that comprehensively cover the capabilities of different multi-agent systems.

2. **Data Collection**:
   - Gather performance data from the original papers or source code implementations (e.g., experimental results from papers or benchmark tests from open-source code).
   - Run the same benchmarks in **MASArena** and record the results.

3. **Comparison Analysis**:
   - Compare performance metrics (e.g., accuracy, response time, task success rate) between **MASArena** implementations and the original implementations.
   - Document and analyze significant differences, discussing possible causes (e.g., algorithmic differences, hyperparameter settings, library versions).

**Implementation Considerations:**
- Ensure that **MASArena**'s implementations faithfully reproduce the core logic of the original implementations as much as possible.
- If the original implementation of some systems is unavailable (e.g., not open-sourced), implement them in **MASArena** based on reasonable assumptions derived from the paper descriptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Comparison: MASArena Implementations vs Original Implementations of Multi-Agent Systems #14

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark Comparison: MASArena Implementations vs Original Implementations of Multi-Agent Systems #14

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions