Skip to content

Conversation

@ghuls
Copy link
Contributor

@ghuls ghuls commented Jan 20, 2025

Fix off by one bug when converting BAM files to d4 files.

Fixes: #75

Fix off by one bug when converting BAM files to d4 files.

Fixes: 38#75
@ghuls
Copy link
Contributor Author

ghuls commented Jan 20, 2025

A previous pull request semi worked (although fixing ref_end instead of ref_len feels cleaner) : #76 , except for reads that start at position 0.

The new implementation is tested against samtools depth and probably should be converted to a test that can be run that d4tools always works correctly:

# Samtools depth parameters:
$ samtools depth
Usage: samtools depth [options] in.bam [in.bam ...]

Options:
  -a           Output all positions (including zero depth)
  -a -a, -aa   Output absolutely all positions, including unused ref seqs
  -r REG       Specify a region in chr or chr:from-to syntax
  -b FILE      Use bed FILE for list of regions
  -f FILE      Specify list of input BAM/SAM/CRAM filenames
  -X           Use custom index files (in -X *.bam *.bam.bai order)
  -g INT       Remove specified flags from default filter-out flag list
  -G, --excl-flags FLAGS
               Add specified flags to the  default filter-out flag list
               [UNMAP,SECONDARY,QCFAIL,DUP]
      --incl-flags FLAGS
               Only include records with at least one the FLAGs present [0]
      --require-flags FLAGS
               Only include records with all of the FLAGs present [0]
  -H           Print a file header line
  -l INT       Minimum read length [0]
  -o FILE      Write output to FILE [stdout]
  -q, --min-BQ INT
               Filter bases with base quality smaller than INT [0]
  -Q, --min-MQ INT
               Filter alignments with mapping quality smaller than INT [0]
  -J           Include reads with deletions in depth computation
  -s           Do not count overlapping reads within a template
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity

# Run SAMtools depth:
#   -a: Output all positions (0-depth)
#   -Q 10: Filter reads with mapping quality smaller than 10.
#       Reads with UNMAP,SECONDARY,QCFAIL,DUP are filtered by default too (see -G option)
#   -J: Count "D" in CIGAR string as coverage
#       (d4tools create should not count positions in reads that are "D" in the CIGAR string
#       but currently does this anyway).
# Convert SAMtools depth output to bedGraph.
# Compact adjacent bedGraph entries if they have the same value with bedGraphPack of Kent tools
samtools depth -@ 4 -aa -Q 10 -J test.bam \
  | awk -F '\t' '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3 }' \
  | bedGraphPack /dev/stdin test.samtools_depth_MQ10_count_deletions.bedGraph

# Run SAMtools flags to get integer flag value to use with d4tools to filter out the same reads:
$ samtools flags UNMAP,SECONDARY,QCFAIL,DUP
0x704   1796    UNMAP,SECONDARY,QCFAIL,DUP


# Run d4tools create:
#   -q 10: Filter reads with mapping quality smaller than 10.
#   -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags to make
#            sure that d4tools uses the same reads as samtools depth
#   -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4
 

# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4 \
  > test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph


# No differences:
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph

@ghuls
Copy link
Contributor Author

ghuls commented Jan 20, 2025

Unrelated to this patch, but it seems to me that d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data:

# Run d4tools create:
#   -q 10: Filter reads with mapping quality smaller than 10.
#   -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
#   -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry of a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compresssed.d4
 

# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.d4 \
  > test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph


# Run d4tools create:
#   -q 10: Filter reads with mapping quality smaller than 10.
#   -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
#   without compression: sometimes writes a wrong depth value as last entry for a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4

# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4 \
  > test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
# With compressed output samtools depth and d4tools give the same.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph

# With uncompressed output samtools depth and d4tools give the same.
# Only for 3 chromosomes out of 1870 chromosomes this problem appears.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
--- test.samtools_depth_MQ10_count_deletions.bedGraph   2025-01-20 11:37:33.000000000 +0100
+++ test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph      2025-01-20 17:47:31.000000000 +0100
@@ -15016909,7 +15016909,8 @@
 211000022280772        13654   13701   1
 211000022280772        13701   13713   0
 211000022280772        13713   13762   1
-211000022280772        13762   15417   0
+211000022280772        13762   13763   0
+211000022280772        13763   15417   2
 211000022280659        0       594     0
 211000022280659        594     642     1
 211000022280659        642     703     0
@@ -15017355,7 +15017356,8 @@
 211000022279098        12261   12270   3
 211000022279098        12270   12289   2
 211000022279098        12289   12310   1
-211000022279098        12310   12368   0
+211000022279098        12310   12311   0
+211000022279098        12311   12368   2
 211000022280761        0       75      0
 211000022280761        75      123     1
 211000022280761        123     1433    0
@@ -15030169,7 +15030171,8 @@
 211000022280548        1426    1434    1
 211000022280548        1434    1474    2
 211000022280548        1474    1483    1
-211000022280548        1483    2618    0
+211000022280548        1483    1484    0
+211000022280548        1484    2618    2
 211000022280553        0       2561    0
 211000022280554        0       228     0
 211000022280554        228     277     1
@@ -15043475,7 +15043478,8 @@
 211000022279504        1843    1850    15
 211000022279504        1850    1854    14
 211000022279504        1854    1855    12
-211000022279504        1855    2317    0
+211000022279504        1855    1856    0
+211000022279504        1856    2317    2
 211000022279505        0       44      0
 211000022279505        44      92      1
 211000022279505        92      114     0

When running d4tools create only for those chromosomes, it does not seem to happen.

When adding a bunch of other chromosomes, sometimes it gives the correct output and sometimes the wrong output, so I assume it has something to do with a reuse of a buffer that does not get cleared properly (and due to threading, it does not always contain the same value):

uncompressed_output_instability () {
    /staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools create -q 10 -F '~1796' test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4  -f '^2R|^211000022[0-9]{6}' > /dev/null

    /staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 \
        | rg -A 5 211000022280548 \
        | tail
}

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2  ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1265    1426    0
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    2618    0 ==> correct
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2 ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    1484    0
211000022280548 1484    2618    2 ==> wrong
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

$ uncompressed_output_instability
211000022280548 1265    1426    0
211000022280548 1426    1434    1
211000022280548 1434    1474    2
211000022280548 1474    1483    1
211000022280548 1483    2618    0 ==> correct
211000022280553 0       2561    0
211000022280554 0       228     0
211000022280554 228     277     1
211000022280554 277     330     0
211000022280554 330     377     1

@ghuls
Copy link
Contributor Author

ghuls commented Jan 21, 2025

@brentp Can you take a look?

@brentp
Copy link
Collaborator

brentp commented Jan 21, 2025

thank you! this looks good to me
@cademirch @Jakob37 , care to take a look?
would be nice to add a test here.
can you open a new issue for the end-of-chromosome?

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 22, 2025

Nice @ghuls ! I'll give this a test run

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 22, 2025

Works well in my hands for the example in the issue you provided.

Setting up the test data - genome 30 bp and read 21 bp.

$ cat genome/genome_30.fa
>chr1
ACGTAACCGGTTAAACCCGGGTTTAAAACC
$ grep -v "^>" genome/genome_30.fa | tr -d "\n" |  wc -c
30
$ cat fake_read.fa
>read1
AACCGGTTAAACCCGGGTTTA
$ grep -v "^>" fake_read.fa | tr -d "\n" |  wc -c
21

Aligning and indexing.

$ bwa index genome/genome_30.fa
$ bwa mem -k 2 -T 0 genome/genome.fa reads.fa | samtools sort -o out.bam -
$ samtools index out.bam

Checking samtools depth.

$ samtools depth -a out.bam
chr1    1       0
chr1    2       0
chr1    3       0
chr1    4       0
chr1    5       1
chr1    6       1
chr1    7       1
chr1    8       1
chr1    9       1
chr1    10      1
chr1    11      1
chr1    12      1
chr1    13      1
chr1    14      1
chr1    15      1
chr1    16      1
chr1    17      1
chr1    18      1
chr1    19      1
chr1    20      1
chr1    21      1
chr1    22      1
chr1    23      1
chr1    24      1
chr1    25      1
chr1    26      0
chr1    27      0
chr1    28      0
chr1    29      0
chr1    30      0

Creating d4tools using the current version and the fixed version.

$ d4tools create -q0 out.bam out_current.d4
$ /home/jakob/src/d4-format-ghuls/target/debug/d4tools create -q0 out.bam out_ghul.d4

Checking the depths

$ d4tools show out_current.d4
chr1    0       4       0
chr1    4       26      1
chr1    26      30      0
$ d4tools show out_ghul.d4
chr1    0       4       0
chr1    4       25      1
chr1    25      30      0

Nice 👌

I agree that it would be great to add a unit test for this!

@ghuls
Copy link
Contributor Author

ghuls commented Jan 22, 2025

Also before (or at least with #76), if your read
It is also important to have reads that start or end at the first/last position of the chromosome and to have multiple of them.
It took a bit of time to write this patch as in a previous iteration when there was more than one read that started at the first position,
for chr 0 1 I would only get 1 as count and for further positions only it was having the correct depth.

$ cat genome_mini.fa
>chr1
ACGTAACCGGTTAAACCCGGGTTTAAAACC
>chr2
GCGCAGCTAAATTGCCCCAGGGCTCGAGGGATACCATATT

$ cat fake_reads.fa 
>read_full_chr1
ACGTAACCGGTTAAACCCGGGTTTAAAACC
>read_start_chr1
ACGTAACCGGTTAA
>read_end_chr1
GGGTTTAAAACC
>read_middle_chr1
AACCGGTTAAACCCGGGTTTAA
>read_start_chr2
GCGCAGCTAAATTG
>read_start2_chr2
GCGCAGCTAAATTGC
>read_start2_chr2
GCGCAGCTAAATTGC
>read_end1_chr2
CTCGAGGGATACCATATT
>read_end2_chr2
GGGCTCGAGGGATACCATATT
>read_end3_chr2
TCGAGGGATACCATAT
>read_end4_chr2
GAGGGATACCATA
$ bwa mem -k 2 -T 0  genome_mini.fa fake_reads.fa | samtools sort -o out.bam -

$ samtools index out.bam

$ d4tools create -q 0 out.bam out.d4tools_current.d4 
$ d4tools create -q 0 out.bam out.d4tools_ghuls.d4

$ d4tools view out.d4tools_current.d4 out.d4tools_current.bdg
$ d4tools view out.d4tools_ghuls.d4 out.d4tools_ghuls.bdg

$ samtools depth -aa out.bam | awk '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3}' | bedGraphPack /dev/stdin out.samtools_depth.bdg

$ head -n 50 out.d4tools_current.bdg out.d4tools_ghuls.bdg out.samtools_depth.bdg                                                                                                                              
==> out.d4tools_current.bdg <==                                                                                                                                                             
chr1    0       1       0
chr1    1       4       2
chr1    4       15      3
chr1    15      18      2                           
chr1    18      27      3                      
chr1    27      30      2      
chr2    0       1       0
chr2    1       15      3
chr2    15      16      2
chr2    16      19      0
chr2    19      22      1
chr2    22      23      2
chr2    23      25      3
chr2    25      39      4
chr2    39      40      3
                                               
==> out.d4tools_ghuls.bdg <==
chr1    0       4       2
chr1    4       14      3
chr1    14      18      2                           
chr1    18      26      3                      
chr1    26      30      2 
chr2    0       14      3
chr2    14      15      2
chr2    15      19      0
chr2    19      22      1
chr2    22      23      2
chr2    23      25      3
chr2    25      38      4
chr2    38      39      3
chr2    39      40      2
                                                    
==> out.samtools_depth.bdg <==                          
chr1    0       4       2                           
chr1    4       14      3                           
chr1    14      18      2                           
chr1    18      26      3                                           
chr1    26      30      2                                                                                                                                                                                          
chr2    0       14      3                           
chr2    14      15      2                           
chr2    15      19      0                                           
chr2    19      22      1                                           
chr2    22      23      2                                           
chr2    23      25      3                                           
chr2    25      38      4                                           
chr2    38      39      3                                           
chr2    39      40      2

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 22, 2025

It is also important to have reads that start or end at the first/last position of the chromosome and to have multiple of them.

OK, well spotted! And that looks like a nice test case.

Would you be able to add this as a test to this PR @ghuls ? I think brentp would be happy to merge it after getting it tested.

I.e. adding to https://github.com/38/d4-format/tree/master/d4tools/test/create

Just let me know if you want help with this. This bug impacts us as well. Seems like the d4tools-devs are gone / missing, so it is kind of a community effort to keep this going.

@ghuls
Copy link
Contributor Author

ghuls commented Jan 22, 2025

Help with adding the test would be great.

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 22, 2025

Help with adding the test would be great.

OK, I'll look into it 👍

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 23, 2025

I put up a PR towards this PR with the two additional unit tests outlined above over at your repo: ghuls#1

The existing test bam naturally needed updating.

There are still differences to the samtools depth -J output. Not sure if expected or not.

If we resolve / understand the reasons for them appearing, then I think this PR would be good to go.

@ghuls
Copy link
Contributor Author

ghuls commented Jan 23, 2025

There was a discrepancy in filtering of the BAM reads between SAMtools and d4tools. The SAMtools default makes more sense to me.

@cademirch
Copy link
Contributor

Seems like yall have this handled - ty @ghuls and @Jakob37! Happy to help where needed, just lmk.

@Jakob37
Copy link
Contributor

Jakob37 commented Jan 24, 2025

There was a discrepancy in filtering of the BAM reads between SAMtools and d4tools. The SAMtools default makes more sense to me.

I see!

I guess for this PR the current d4tools output is what we'll use in the tests. Good that you investigated it.

I don't have write rights neither here nor to your PR. But if you think my PR looks OK over at your repo - merge it into this PR and then we'll forward it to brentp.

Just let me know if you want further help.

@ghuls
Copy link
Contributor Author

ghuls commented Feb 4, 2025

@Jakob37 Merged your tests.

@brentp
Copy link
Collaborator

brentp commented Feb 4, 2025

so this now matches samtools output?

@ghuls
Copy link
Contributor Author

ghuls commented Feb 4, 2025

Yes, at least if you use the correct flagsfor both d4 and samtools depth as in : #97 (comment)

@brentp brentp merged commit 729ed9d into 38:master Feb 4, 2025
1 check passed
@ghuls ghuls mentioned this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: off-by-one bug on bam reads

4 participants