-
Notifications
You must be signed in to change notification settings - Fork 20
Fix off by one bug when converting BAM files to d4 files. #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix off by one bug when converting BAM files to d4 files. Fixes: 38#75
|
A previous pull request semi worked (although fixing The new implementation is tested against # Samtools depth parameters:
$ samtools depth
Usage: samtools depth [options] in.bam [in.bam ...]
Options:
-a Output all positions (including zero depth)
-a -a, -aa Output absolutely all positions, including unused ref seqs
-r REG Specify a region in chr or chr:from-to syntax
-b FILE Use bed FILE for list of regions
-f FILE Specify list of input BAM/SAM/CRAM filenames
-X Use custom index files (in -X *.bam *.bam.bai order)
-g INT Remove specified flags from default filter-out flag list
-G, --excl-flags FLAGS
Add specified flags to the default filter-out flag list
[UNMAP,SECONDARY,QCFAIL,DUP]
--incl-flags FLAGS
Only include records with at least one the FLAGs present [0]
--require-flags FLAGS
Only include records with all of the FLAGs present [0]
-H Print a file header line
-l INT Minimum read length [0]
-o FILE Write output to FILE [stdout]
-q, --min-BQ INT
Filter bases with base quality smaller than INT [0]
-Q, --min-MQ INT
Filter alignments with mapping quality smaller than INT [0]
-J Include reads with deletions in depth computation
-s Do not count overlapping reads within a template
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--verbosity INT
Set level of verbosity
# Run SAMtools depth:
# -a: Output all positions (0-depth)
# -Q 10: Filter reads with mapping quality smaller than 10.
# Reads with UNMAP,SECONDARY,QCFAIL,DUP are filtered by default too (see -G option)
# -J: Count "D" in CIGAR string as coverage
# (d4tools create should not count positions in reads that are "D" in the CIGAR string
# but currently does this anyway).
# Convert SAMtools depth output to bedGraph.
# Compact adjacent bedGraph entries if they have the same value with bedGraphPack of Kent tools
samtools depth -@ 4 -aa -Q 10 -J test.bam \
| awk -F '\t' '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3 }' \
| bedGraphPack /dev/stdin test.samtools_depth_MQ10_count_deletions.bedGraph
# Run SAMtools flags to get integer flag value to use with d4tools to filter out the same reads:
$ samtools flags UNMAP,SECONDARY,QCFAIL,DUP
0x704 1796 UNMAP,SECONDARY,QCFAIL,DUP
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags to make
# sure that d4tools uses the same reads as samtools depth
# -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph
# No differences:
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph |
|
Unrelated to this patch, but it seems to me that d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data: # Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry of a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compresssed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# without compression: sometimes writes a wrong depth value as last entry for a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph# With compressed output samtools depth and d4tools give the same.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# With uncompressed output samtools depth and d4tools give the same.
# Only for 3 chromosomes out of 1870 chromosomes this problem appears.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
--- test.samtools_depth_MQ10_count_deletions.bedGraph 2025-01-20 11:37:33.000000000 +0100
+++ test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph 2025-01-20 17:47:31.000000000 +0100
@@ -15016909,7 +15016909,8 @@
211000022280772 13654 13701 1
211000022280772 13701 13713 0
211000022280772 13713 13762 1
-211000022280772 13762 15417 0
+211000022280772 13762 13763 0
+211000022280772 13763 15417 2
211000022280659 0 594 0
211000022280659 594 642 1
211000022280659 642 703 0
@@ -15017355,7 +15017356,8 @@
211000022279098 12261 12270 3
211000022279098 12270 12289 2
211000022279098 12289 12310 1
-211000022279098 12310 12368 0
+211000022279098 12310 12311 0
+211000022279098 12311 12368 2
211000022280761 0 75 0
211000022280761 75 123 1
211000022280761 123 1433 0
@@ -15030169,7 +15030171,8 @@
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
-211000022280548 1483 2618 0
+211000022280548 1483 1484 0
+211000022280548 1484 2618 2
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
@@ -15043475,7 +15043478,8 @@
211000022279504 1843 1850 15
211000022279504 1850 1854 14
211000022279504 1854 1855 12
-211000022279504 1855 2317 0
+211000022279504 1855 1856 0
+211000022279504 1856 2317 2
211000022279505 0 44 0
211000022279505 44 92 1
211000022279505 92 114 0
When running d4tools create only for those chromosomes, it does not seem to happen. When adding a bunch of other chromosomes, sometimes it gives the correct output and sometimes the wrong output, so I assume it has something to do with a reuse of a buffer that does not get cleared properly (and due to threading, it does not always contain the same value): uncompressed_output_instability () {
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools create -q 10 -F '~1796' test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 -f '^2R|^211000022[0-9]{6}' > /dev/null
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 \
| rg -A 5 211000022280548 \
| tail
} |
|
@brentp Can you take a look? |
|
thank you! this looks good to me |
|
Nice @ghuls ! I'll give this a test run |
|
Works well in my hands for the example in the issue you provided. Setting up the test data - genome 30 bp and read 21 bp. Aligning and indexing. Checking samtools depth. Creating d4tools using the current version and the fixed version. Checking the depths Nice 👌 I agree that it would be great to add a unit test for this! |
|
Also before (or at least with #76), if your read $ bwa mem -k 2 -T 0 genome_mini.fa fake_reads.fa | samtools sort -o out.bam -
$ samtools index out.bam
$ d4tools create -q 0 out.bam out.d4tools_current.d4
$ d4tools create -q 0 out.bam out.d4tools_ghuls.d4
$ d4tools view out.d4tools_current.d4 out.d4tools_current.bdg
$ d4tools view out.d4tools_ghuls.d4 out.d4tools_ghuls.bdg
$ samtools depth -aa out.bam | awk '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3}' | bedGraphPack /dev/stdin out.samtools_depth.bdg
$ head -n 50 out.d4tools_current.bdg out.d4tools_ghuls.bdg out.samtools_depth.bdg
==> out.d4tools_current.bdg <==
chr1 0 1 0
chr1 1 4 2
chr1 4 15 3
chr1 15 18 2
chr1 18 27 3
chr1 27 30 2
chr2 0 1 0
chr2 1 15 3
chr2 15 16 2
chr2 16 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 39 4
chr2 39 40 3
==> out.d4tools_ghuls.bdg <==
chr1 0 4 2
chr1 4 14 3
chr1 14 18 2
chr1 18 26 3
chr1 26 30 2
chr2 0 14 3
chr2 14 15 2
chr2 15 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 38 4
chr2 38 39 3
chr2 39 40 2
==> out.samtools_depth.bdg <==
chr1 0 4 2
chr1 4 14 3
chr1 14 18 2
chr1 18 26 3
chr1 26 30 2
chr2 0 14 3
chr2 14 15 2
chr2 15 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 38 4
chr2 38 39 3
chr2 39 40 2 |
OK, well spotted! And that looks like a nice test case. Would you be able to add this as a test to this PR @ghuls ? I think brentp would be happy to merge it after getting it tested. I.e. adding to https://github.com/38/d4-format/tree/master/d4tools/test/create Just let me know if you want help with this. This bug impacts us as well. Seems like the d4tools-devs are gone / missing, so it is kind of a community effort to keep this going. |
|
Help with adding the test would be great. |
OK, I'll look into it 👍 |
|
I put up a PR towards this PR with the two additional unit tests outlined above over at your repo: ghuls#1 The existing test bam naturally needed updating. There are still differences to the If we resolve / understand the reasons for them appearing, then I think this PR would be good to go. |
|
There was a discrepancy in filtering of the BAM reads between SAMtools and d4tools. The SAMtools default makes more sense to me. |
I see! I guess for this PR the current d4tools output is what we'll use in the tests. Good that you investigated it. I don't have write rights neither here nor to your PR. But if you think my PR looks OK over at your repo - merge it into this PR and then we'll forward it to brentp. Just let me know if you want further help. |
Fix off by one bam add tests
|
@Jakob37 Merged your tests. |
|
so this now matches samtools output? |
|
Yes, at least if you use the correct flagsfor both d4 and samtools depth as in : #97 (comment) |
Fix off by one bug when converting BAM files to d4 files.
Fixes: #75