Skip to content

Clarification Requested on Nucleotide Count in DNABERT2 Pre-training Data (Table 11 vs. Provided Files) #149

@KehanChen510

Description

@KehanChen510

Hi DNABERT2 authors and maintainers,

First, thank you for the great work on DNABERT2 and for making the pre-training data publicly available on github.

I have a question regarding the nucleotide count statistics presented in the paper compared to the actual data provided.

According to the file in the link, the pre-training dataset (train.txt) contains 32,387,832 sequences, each 1,000 nucleotides long. This implies a total nucleotide count of 32,387,832,000 (approximately 32,387 M nucleotides).

However, looking at Table 11 in the paper, the sum of the "Num. of Nucleotides (M)" column for the listed datasets appears to be significantly larger than 32,387 M nucleotides.

Could you help clarify this apparent discrepancy? I want to ensure I'm interpreting the data correctly. Some possibilities I considered, but am unsure about, include:

  • Data Reuse: Is the same sequence data used in multiple training examples or counted multiple times across the listed datasets in Table 11?
  • Different Datasets: Is the dataset from github a subset or a processed version distinct from the combined datasets summarized in Table 11?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions