Skip to content

para-lost/ECHO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the code accompanying the paper:

Constantly Improving Image Models Need Constantly Improving Benchmarks
Jiaxin Ge*, Grace Luo*, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, and David M. Chan
UC Berkeley

For any questions or inquiries, please contact us at echo-bench@googlegroups.com.


About the Dataset

ECHO stands for Extracting Community Hatched Observations. ECHO is a framework for constructing benchmarks directly from social media posts, which showcase novel prompts and qualitative user judgements. As a case study, we apply ECHO to discussion of GPT-4o Image Gen on Twitter/X. Below we describe the data provided in this initial release.

We provide the dataset in the following HuggingFace repo: echo-bench/echo2025. The dataset contains the following splits:

Split Size Description
analysis 29.3k Moderate-quality data suitable for large-scale analysis.
text_to_image 848 High-quality data with prompt-only inputs for benchmarking.
image_to_image 710 High-quality data with prompt and image inputs for benchmarking.

Setup

conda env create -f environment.yaml
conda activate echo

Quickstart

Load the academic version of the dataset:

ds = load_dataset(
    "echo-bench/echo2025",
    name="text_to_image", # ["analysis", "text_to_image", "image_to_image"]
    split="test",
)

See echo_bench/load_dataset.ipynb for a more in-depth walkthrough for loading the dataset.

Evaluation

All evaluation-related code is located in the eval folder.

AutoEval

The folder eval/autoeval contains code for using an ensemble of VLM-as-a-judge to evaluate model outputs. We follow the "single answer grading" setup from MT-Bench. This setup assigns a score to each output, converts these scores into pseudo pairwise comparisons, then calculates the per-model win rates.

Community-Driven Metrics

The folder eval/community_driven_metrics contains implementations of four metrics derived from community feedback. These metrics correspond to each subfolder:

  • color_shift: Computes color shift magnitude as the average difference between the color histogram of the input versus output images.
  • face_identity: Computes face embedding similarity using AuraFace.
  • spatial_preservation: Computes structure distance using the Frobenius norm of Gram matrices derived from DINO features, implemented using the code from Splice.
  • text_rendering: Computes text rendering accuracy using VLM-as-a-judge.

The get_metrics_assignment folder contains code for classifying which metrics apply to each sample.

Data Collection Pipeline

Coming soon.

BibTeX

@article{ge2025echo,
  title={Constantly Improving Image Models Need Constantly Improving Benchmarks},
  author={Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan},
  journal={arXiv preprint arXiv:2510.15021},
  year={2025}
}

About

Echo: "Constantly Improving Image Models Need Constantly Improving Benchmarks"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •