GitHub - mrazhou/SeTa: [CVPR2025] Official code repository for SeTa: "Scale Efficient Training for Large Datasets"

Scale Efficient Training for Large Datasets

CVPR 2025 | [Paper] | [Code]

Author: Qing Zhou, Junyu Gao, Qi Wang

SeTa is a plug-and-play efficient training tool for various and large datasets with three-line codes.

Support pruning methods: Random, InfoBatch, and SeTa.

Installation

pip install git+https://github.com/mrazhou/SeTa

Or you can clone this repo and install it locally.

git clone https://github.com/mrazhou/SeTa
cd SeTa
pip install -e .

Usage

To adapt your code with SeTa/InfoBatch, just change the following three lines:

from seta import prune

# 1. Wrap dataset
train_data = prune(train_data, args) # args: prune_type, epochs...

# 2. Pass sampler to DataLoader
train_loader = DataLoader(train_data, sampler=train_data.sampler)

for epoch in range(args.epochs):
    for batch in train_loader:
        # 3. Update loss
        loss = train_data.update(loss)

Experiment

In this repository, we provide two examples to demonstrate the usage of SeTa. CIFAR10/CIFAR100 (support resnet18/50/101) and ImageNet (support various CNNs/Transformers/Mamba) are used as the datasets.

CIFAR10/CIFAR100 (for exploratory research.)

# For lossless pruning with SeTa
bash scripts/cifar.sh SeTa 0.1 5 0.9

# For InfoBatch
bash scripts/cifar.sh InfoBatch 0.5

# For Static Random
bash scripts/cifar.sh Static

# For Dynamic Random
bash scripts/cifar.sh SeTa 0 1 1

ImageNet (for large-scale and cross-architecture comprehensive validation.)

# For lossless pruning with SeTa
bash scripts/imagenet.sh

# For CNNs
bash scripts/imagenet.sh SeTa mobilenetv3_small_050

# For Transformers
bash scripts/imagenet.sh SeTa vit_tiny_path16_224

# For Vim
# refer to https://github.com/hustvl/Vim for more details

Results

Citation

If you find this repository helpful, please consider citing our paper:

@inproceedings{zhou2025scale,
  title={Scale Efficient Training for Large Datasets},
  author={Zhou, Qing and Gao, Junyu and Wang, Qi},
  booktitle={CVPR},
  year={2025}
}

and the original InfoBatch paper:

@inproceedings{qin2024infobatch,
  title={InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning},
  author={Qin, Ziheng and Wang, Kai and Zheng, Zangwei and Gu, Jianyang and Peng, Xiangyu and Zhaopan Xu and Zhou, Daquan and Lei Shang and Baigui Sun and Xuansong Xie and You, Yang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

Acknowledgments

Thanks very much to InfoBatch for implementing the basic framework for dynamic pruning by three lines of code.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
examples		examples
scripts		scripts
.gitignore		.gitignore
README.md		README.md
seta.py		seta.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scale Efficient Training for Large Datasets

Installation

Usage

Experiment

Results

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mrazhou/SeTa

Folders and files

Latest commit

History

Repository files navigation

Scale Efficient Training for Large Datasets

Installation

Usage

Experiment

Results

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages