Skip to content

launchnlp/ManyICLBench

Repository files navigation

ManyICLBench

📄 Accepted to ACL 2025 (Main Conference)
🔗 Paper | 🏆 Leaderboard | 📊 Dataset


📌 Overview

ManyICLBench is a benchmark designed to evaluate long-context language models (LCLMs) via many-shot in-context learning (ICL). We investigate whether performance improves with additional demonstrations and introduce a new metric, Sample Learning Ratio (SLR), to characterize task types:

  • SSL (Similar-Sample Learning): Tasks where models benefit from retriving similar demostrations.
  • ASL (All-Sample Learning): Tasks where models need to understand all demonstrations.

⚙️ Getting Started

1. Install dependencies

pip install -r requirements.txt

2. Run evaluation on a model

bash start_vllm_serve.sh

Wait till the server starts

bash evaluate.sh

3. Generate your final results

python create_csv.py

🚀 Leaderboard

Submit your model results at:
📍 https://huggingface.co/spaces/launch/ManyICLBench_Leaderboard

📑 Citation

If you use our benchmark or results in your work, please cite us:

@article{zou2025manyshotincontextlearninglongcontext,
  title={On Many-Shot In-Context Learning for Long-Context Evaluation}, 
  author={Kaijian Zou and Muhammad Khalifa and Lu Wang},
  journal={arXiv preprint arXiv:2411.07130},
  year={2025}
}

📬 Contact

🙏 Acknowledgements

Part of the codebase is based on RULER

We thank the reviewers at ICLR 2025 and ACL 2025 for their insightful feedback. We also appreciate the Hugging Face and vLLM communities for their tools and infrastructure, which greatly supported this project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published