Skip to content

Conversation

@calm329
Copy link

@calm329 calm329 commented Dec 22, 2025

Summary

  • Add R2DatasetLoader class that fetches training data from Cloudflare R2 parquet files instead of HuggingFace API
  • Add --neuron.data_source config option to switch between huggingface (default) and r2
  • Add get_dataset_loader() factory function for seamless loader selection
  • Update miner, validator, reward, avg_handler, and eval scripts to use configurable loader
  • Maintain same deterministic seeding mechanism (block + UID) for data selection

Test plan

  • Set R2 dataset environment variables (R2_DATASET_ACCOUNT_ID, R2_DATASET_BUCKET_NAME, R2_DATASET_ACCESS_KEY_ID, R2_DATASET_SECRET_ACCESS_KEY)
  • Run miner with --neuron.data_source r2 and verify data loads from R2
  • Run miner with --neuron.data_source huggingface (or default) and verify existing behavior unchanged
  • Verify deterministic page selection produces same results for same block + UID

Closes #49


Contribution by Gittensor, learn more at https://gittensor.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch DataLoader from HuggingFace to R2

1 participant