Clarification Requested on Nucleotide Count in DNABERT2 Pre-training Data (Table 11 vs. Provided Files)

Hi DNABERT2 authors and maintainers,

First, thank you for the great work on DNABERT2 and for making the pre-training data publicly available on github.

I have a question regarding the nucleotide count statistics presented in the paper compared to the actual data provided. 

According to the file in the link, the pre-training dataset (train.txt) contains 32,387,832 sequences, each 1,000 nucleotides long. This implies a total nucleotide count of 32,387,832,000 (approximately **32,387 M** nucleotides).

However, looking at Table 11 in the paper, the sum of the **"Num. of Nucleotides (M)"** column for the listed datasets appears to be significantly larger than 32,387 M nucleotides.

Could you help clarify this apparent discrepancy? I want to ensure I'm interpreting the data correctly. Some possibilities I considered, but am unsure about, include:

- Data Reuse: Is the same sequence data used in multiple training examples or counted multiple times across the listed datasets in Table 11?
- Different Datasets: Is the dataset from github a subset or a processed version distinct from the combined datasets summarized in Table 11?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification Requested on Nucleotide Count in DNABERT2 Pre-training Data (Table 11 vs. Provided Files) #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification Requested on Nucleotide Count in DNABERT2 Pre-training Data (Table 11 vs. Provided Files) #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions