Skip to content

Question about SNP sites retained during abundance estimation #2

@cenyao-x

Description

@cenyao-x

Hi:
While examining the abundance estimation step in the compute_abundances_all.py, I noticed that SNP sites are currently filtered such that only positions with observed ALT reads in the sample are retained:

var_reads = pd.merge(df_read_counts, df_AF,
                        left_on=['position', 'ref', 'base', 'chrom'],
                        right_on=['POS', 'REF', 'ALT', 'CHROM'],
                        how='inner')

ref_reads = pd.merge(df_read_counts, df_AF,
                        left_on=['position', 'ref', 'base', 'chrom'],
                        right_on=['POS', 'REF', 'REF', 'CHROM'],
                        how='inner')

merged_ref_var = pd.merge(ref_reads.iloc[:, :5], var_reads.iloc[:, :5], on=['chrom','position'], how='inner')

However, all SNP sites observed in the sample—whether showing only REF reads or including ALT reads—can provide information. In particular, sites with only REF reads in the sample may still carry information about other strains that have ALT alleles at that position.
Is this filtering intentional, or could it be a potential bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions