Constantly Improving Image Models Need Constantly Improving Benchmarks

Project Page | Paper | Data

This repository contains the code accompanying the paper:

Constantly Improving Image Models Need Constantly Improving Benchmarks
Jiaxin Ge*, Grace Luo*, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, and David M. Chan
UC Berkeley

For any questions or inquiries, please contact us at echo-bench@googlegroups.com.

About the Dataset

ECHO stands for Extracting Community Hatched Observations. ECHO is a framework for constructing benchmarks directly from social media posts, which showcase novel prompts and qualitative user judgements. As a case study, we apply ECHO to discussion of GPT-4o Image Gen on Twitter/X. Below we describe the data provided in this initial release.

We provide the dataset in the following HuggingFace repo: echo-bench/echo2025. The dataset contains the following splits:

Split	Size	Description
`analysis`	29.3k	Moderate-quality data suitable for large-scale analysis.
`text_to_image`	848	High-quality data with prompt-only inputs for benchmarking.
`image_to_image`	710	High-quality data with prompt and image inputs for benchmarking.

Setup

conda env create -f environment.yaml
conda activate echo

Quickstart

Load the academic version of the dataset:

ds = load_dataset(
    "echo-bench/echo2025",
    name="text_to_image", # ["analysis", "text_to_image", "image_to_image"]
    split="test",
)

See echo_bench/load_dataset.ipynb for a more in-depth walkthrough for loading the dataset.

Evaluation

All evaluation-related code is located in the eval folder.

AutoEval

The folder eval/autoeval contains code for using an ensemble of VLM-as-a-judge to evaluate model outputs. We follow the "single answer grading" setup from MT-Bench. This setup assigns a score to each output, converts these scores into pseudo pairwise comparisons, then calculates the per-model win rates.

Community-Driven Metrics

The folder eval/community_driven_metrics contains implementations of four metrics derived from community feedback. These metrics correspond to each subfolder:

color_shift: Computes color shift magnitude as the average difference between the color histogram of the input versus output images.
face_identity: Computes face embedding similarity using AuraFace.
spatial_preservation: Computes structure distance using the Frobenius norm of Gram matrices derived from DINO features, implemented using the code from Splice.
text_rendering: Computes text rendering accuracy using VLM-as-a-judge.

The get_metrics_assignment folder contains code for classifying which metrics apply to each sample.

Data Collection Pipeline

Coming soon.

BibTeX

@article{ge2025echo,
  title={Constantly Improving Image Models Need Constantly Improving Benchmarks},
  author={Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan},
  journal={arXiv preprint arXiv:2510.15021},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
echo_bench		echo_bench
eval		eval
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Constantly Improving Image Models Need Constantly Improving Benchmarks

Project Page | Paper | Data

About the Dataset

Setup

Quickstart

Evaluation

AutoEval

Community-Driven Metrics

Data Collection Pipeline

BibTeX

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

para-lost/ECHO

Folders and files

Latest commit

History

Repository files navigation

Constantly Improving Image Models Need Constantly Improving Benchmarks

Project Page | Paper | Data

About the Dataset

Setup

Quickstart

Evaluation

AutoEval

Community-Driven Metrics

Data Collection Pipeline

BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages