Skip to content

Conversation

@agalitsyna
Copy link
Member

@agalitsyna agalitsyna commented Jul 1, 2022

This PR introduces several substantial changes that improve usability of pairtools.

1. Mismatches reporting

I utilized "MD" field of sam file and added an option to extract mismatches from the alignment pairs. Here, it is reported as additional column "mismatches" with parse/parse2. With the help of Anton's code on scsHi-C and pysam engine to parse mismatches, it turned out to be rather simple.

For now, the user can request to store mismatches as a separate column of .pairs file in a comprehensive format: "{ref_letter}:{mutated_letter}:{phred}:{ref_position}:{read_position}" (all mutations listed separated by comma).

This column, in principle, can be converted into two important types of data: 1. number of converted pairs per alignment/pair (needed for scsHi-C); 2. nucleotide variants in your Hi-C genome, 3. mutated positions in read (might be useful for Methyl-Hi-C and related stuff).

Example output:

image

This feature, although not producing any specific analysis, is potentially very powerful. The column with mutations can be used in downstream analysis as is, although we may want to design more specific functions for pairtools in the future.

You may see that the code to support this feature is tiny and easy to support.

2. Docs improvements

  • There was no description of additional columns produced by various modules of pairtools. I added the summary table of extra columns in formats docs.

  • More cross-references between docs and tutorials

3. Python 3.10 support by tests

  • Tests work with Python 3.10 now

4. parse2

  • flipping is off by default for parse2, we've added explanations to this decision

Previous tests were not working with Python 3.10 because both pysam and bioframe did not support some packages from conda's python 3.10. A workaround is to install them separately through pip, which does not have these requirements.

@agalitsyna agalitsyna requested review from Phlya and golobor July 3, 2022 19:13
@golobor
Copy link
Member

golobor commented Jul 4, 2022

Wow, this is really great!!! This feature is indeed essential for many protocols - some even organize haplotype-resolved Hi-C based on such mapping approach. And the amount of new code is surprisingly tiny. Super nice!!
A couple of questions:
(a) if I understand correctly, currently parse always executes get_mismatches_c? This may potentially be a bit costly, right? One alternative would be to only calculate it when users specify --add-cols mismatches.
(b) is this feature available both in parse and parse2?..

@agalitsyna
Copy link
Member Author

agalitsyna commented Jul 4, 2022

Thanks!
(a) Good catch, it makes sense to run it only if additional column with mismatches is requested
(b) Yes, it's available for both, although for parse2 only mutations from the left-sided alignment will be reported (for the case of readthrough, see this lines. But it works for SAM tags and other alignment properties reported for complex walks. We did not decide on any voting scheme for readthroughs, not sure it should be addressed in more detail for now.

What do you think of mismatches format? Current is rather lengthy, but seems to be comprehensive: one mismatch is "{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}", and multiple will be reported as comma-separated list.

@agalitsyna
Copy link
Member Author

I will merge it for now because it would be great to start see the docs updates. If there are suggestions on how to improve mutations reporting, will be great to have them submitted separately!

@agalitsyna agalitsyna merged commit 1ad161f into master Jul 11, 2022
@agalitsyna agalitsyna deleted the detect_mutations branch June 16, 2025 19:32
@agalitsyna agalitsyna restored the detect_mutations branch June 16, 2025 19:32
@agalitsyna agalitsyna deleted the detect_mutations branch June 16, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants