add support for read group and PCR free optitical duplicate only filtering #53
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've added a --opticalOnly duplicate option to only mark optical duplicates in PCR free data. This is done using a shortcut where reads on the same tile are considered duplicates rather than trying to measure the distance between reads on the same tile. I also added a --optPlusExAmp flag to mark reads in the same lane as duplicates (should capture both optical and exAmp duplicates which can occupy positions within the same lane). The read group and tile/lane support use khashs to keep track of a 20 bit iterator that identifies uniq RG/tilenumber/lanenumber combinations . You should be able to add UMI support by adding a single method to pull values out of the SAM attributes similar to what happens here in extraction the the RG value. Additional memory use is restricted to the size of the new khash that tracks the RG/tilenumber/lanenumber combinations. Effect on runtime is negligable