-
Notifications
You must be signed in to change notification settings - Fork 36
Column with annotated mismatched in parse/parse2 #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s described in the main page
|
Wow, this is really great!!! This feature is indeed essential for many protocols - some even organize haplotype-resolved Hi-C based on such mapping approach. And the amount of new code is surprisingly tiny. Super nice!! |
|
Thanks! What do you think of mismatches format? Current is rather lengthy, but seems to be comprehensive: one mismatch is "{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}", and multiple will be reported as comma-separated list. |
|
I will merge it for now because it would be great to start see the docs updates. If there are suggestions on how to improve mutations reporting, will be great to have them submitted separately! |
This PR introduces several substantial changes that improve usability of pairtools.
1. Mismatches reporting
I utilized "MD" field of sam file and added an option to extract mismatches from the alignment pairs. Here, it is reported as additional column "mismatches" with parse/parse2. With the help of Anton's code on scsHi-C and pysam engine to parse mismatches, it turned out to be rather simple.
For now, the user can request to store mismatches as a separate column of .pairs file in a comprehensive format: "{ref_letter}:{mutated_letter}:{phred}:{ref_position}:{read_position}" (all mutations listed separated by comma).
This column, in principle, can be converted into two important types of data: 1. number of converted pairs per alignment/pair (needed for scsHi-C); 2. nucleotide variants in your Hi-C genome, 3. mutated positions in read (might be useful for Methyl-Hi-C and related stuff).
Example output:
This feature, although not producing any specific analysis, is potentially very powerful. The column with mutations can be used in downstream analysis as is, although we may want to design more specific functions for pairtools in the future.
You may see that the code to support this feature is tiny and easy to support.
2. Docs improvements
There was no description of additional columns produced by various modules of pairtools. I added the summary table of extra columns in formats docs.
More cross-references between docs and tutorials
3. Python 3.10 support by tests
4. parse2
Previous tests were not working with Python 3.10 because both pysam and bioframe did not support some packages from conda's python 3.10. A workaround is to install them separately through pip, which does not have these requirements.