AI Agent Benchmark

AI Agent Benchmark is designed to measure performance of various LLM backed autonomous agents on common tasks such as infromation retrieval, file manipulation etc. The goal is to create a standard set of tasks that can be used to compare performance of various agents and help developers improve their agents. The emphasis is to study how different planning strategies, prompts to break down tasks and other factors affect performance of agents.

Leaderboard

Agent	LLM	Composite Score
AutoGPT	GPT-3.5	X
AutoGPT	GPT-4	X
babyAGI	GPT-3.5	X
babyAGI	GPT-4	X
AgentGPT	GPT-3.5	X
AgentGPT	GPT-4	X
superAGI	GPT-3.5	X
superAGI	GPT-4	X

Methodology

Each agent + LLM pair are given 5 runs for each task. Each task has a prompt, success condition and max number of steps to prevent running forever. Each run is timed, number of steps used to achieve the task is recorded. If max number of steps is reached, the run is considered a failure.

Agent Benchmark's goal is to measure planning and execution performace of agents, not their raw intelligence or creativity as it's mostly a function of underlying LLMs. For example a bad test would be to ask an agent to write a poem and try to score it based on how good the poem is. A good test would be to ask an agent to write 3 blogposts of 100 words and measure how many steps it took to complete the task.

Scoring

1 point for each successful run
0.5 point for halting before max number of steps and successfully completing the task
0 points for each run that failed due to max number of steps reached
0 points for each run that failed due to error
0 points for each run that failed due to incorrect output

Tasks

File Manipulation

Information retrival

Googling a term
Searching a term on wikipedia

Iterative Tasks

Creating 10 files

Composite tasks

Looking up a fact on wikipedia and writing it to a file

Code manipulation

WIP

Index of Autonomous Agents

AutoGPT babyAGI AgentGPT superAGI

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
complex_tasks		complex_tasks
file_manipulation		file_manipulation
information_retrieval		information_retrieval
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_agents.md		all_agents.md
raw_results.md		raw_results.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Agent Benchmark

Leaderboard

Methodology

Scoring

Tasks

File Manipulation

Information retrival

Iterative Tasks

Composite tasks

Code manipulation

Index of Autonomous Agents

About

Uh oh!

Releases

Packages

License

romanzubenko/agent-leaderboard

Folders and files

Latest commit

History

Repository files navigation

AI Agent Benchmark

Leaderboard

Methodology

Scoring

Tasks

File Manipulation

Information retrival

Iterative Tasks

Composite tasks

Code manipulation

Index of Autonomous Agents

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages