Hi DNABERT2 authors and maintainers,
First, thank you for the great work on DNABERT2 and for making the pre-training data publicly available on github.
I have a question regarding the nucleotide count statistics presented in the paper compared to the actual data provided.
According to the file in the link, the pre-training dataset (train.txt) contains 32,387,832 sequences, each 1,000 nucleotides long. This implies a total nucleotide count of 32,387,832,000 (approximately 32,387 M nucleotides).
However, looking at Table 11 in the paper, the sum of the "Num. of Nucleotides (M)" column for the listed datasets appears to be significantly larger than 32,387 M nucleotides.
Could you help clarify this apparent discrepancy? I want to ensure I'm interpreting the data correctly. Some possibilities I considered, but am unsure about, include:
- Data Reuse: Is the same sequence data used in multiple training examples or counted multiple times across the listed datasets in Table 11?
- Different Datasets: Is the dataset from github a subset or a processed version distinct from the combined datasets summarized in Table 11?