Skip to content

Conversation

@JorritvanGils
Copy link
Contributor

@JorritvanGils JorritvanGils commented Oct 7, 2025

This PR replaces the HuggingFace data loader with a Cloudflare R2–based implementation.

Changes

  • New DatasetLoader for reading sharded Parquet datasets from R2.
  • Adds S3-compatible access via s3fs.
  • Full multi-GPU / multi-rank support.
  • New dataloader implemented/replaced in different scripts (see table below)
Script UID Input Varying rank data Data source Buffer reduction
miner.py uid + rank yes random no
reward.py miner_uid+rank yes miner_data yes
avg_handler.py uid no random yes
eval_loss.py uid no random no

Loader options:

  • randomness: randomize start config/shard/row_group/row.
  • max_*: limit configs/shards/row_groups/rows.
  • debugging
  • reduce_buffer_size(method): truncate or sample.

Reproducibility is maintained using a seeded RNG derived from block_number, uid, and sometimes rank.

Try it out:

  1. Migrate edu-dataset bucket to your R2 account.

2 Check Out the PR and Install

git clone https://github.com/dstrbtd/DistributedTraining.git
cd DistributedTraining

git fetch origin pull/81/head:r2-dataset-integration
git checkout r2-dataset-integration

pip install -e .
  1. Create an .env at the project root and export your R2 credentials:
cat > .env << 'EOF'
export R2_ACCOUNT_ID="..."
export R2_BUCKET_NAME="edu-dataset"
export R2_READ_ACCESS_KEY_ID="..."
export R2_READ_SECRET_ACCESS_KEY="..."
export R2_WRITE_ACCESS_KEY_ID="..."
export R2_WRITE_SECRET_ACCESS_KEY="..."
export R2_ADMIN_ACCESS_KEY_ID="..."
export R2_ADMIN_SECRET_ACCESS_KEY="..."
EOF

set -a
source .env
set +a
  1. run miner/validator as usual.

See also my fork: r2-dataset-integration branch

@JorritvanGils JorritvanGils force-pushed the r2-dataset-integration branch 7 times, most recently from 9726b1e to 6c7f699 Compare October 14, 2025 15:10
@JorritvanGils JorritvanGils force-pushed the r2-dataset-integration branch 7 times, most recently from fbd0a55 to 5d8f9d7 Compare December 10, 2025 17:33
@JorritvanGils JorritvanGils force-pushed the r2-dataset-integration branch 2 times, most recently from 291bddc to d44b481 Compare December 18, 2025 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant