📄 Accepted to ACL 2025 (Main Conference)
🔗 Paper | 🏆 Leaderboard | 📊 Dataset
ManyICLBench is a benchmark designed to evaluate long-context language models (LCLMs) via many-shot in-context learning (ICL). We investigate whether performance improves with additional demonstrations and introduce a new metric, Sample Learning Ratio (SLR), to characterize task types:
- SSL (Similar-Sample Learning): Tasks where models benefit from retriving similar demostrations.
- ASL (All-Sample Learning): Tasks where models need to understand all demonstrations.
pip install -r requirements.txtbash start_vllm_serve.shWait till the server starts
bash evaluate.shpython create_csv.pySubmit your model results at:
📍 https://huggingface.co/spaces/launch/ManyICLBench_Leaderboard
If you use our benchmark or results in your work, please cite us:
@article{zou2025manyshotincontextlearninglongcontext,
title={On Many-Shot In-Context Learning for Long-Context Evaluation},
author={Kaijian Zou and Muhammad Khalifa and Lu Wang},
journal={arXiv preprint arXiv:2411.07130},
year={2025}
}- 🧑💻 Lead author: Kaijian Zou(zkjzou@umich.edu)
- ❓ For questions or bugs: please open an issue
Part of the codebase is based on RULER
We thank the reviewers at ICLR 2025 and ACL 2025 for their insightful feedback. We also appreciate the Hugging Face and vLLM communities for their tools and infrastructure, which greatly supported this project.