Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ jobs:
hash -r
conda config --set always_yes yes --set changeps1 no
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels bioconda
conda config --get

Expand All @@ -48,8 +47,11 @@ jobs:
conda info -a

# Create test environment and install deps
conda create -q -n test-environment python=${{ matrix.python-version }} setuptools pip cython numpy pandas nose samtools pysam scipy pyyaml bioframe
conda create -q -n test-environment -c anaconda python=${{ matrix.python-version }} setuptools pip numpy scipy pandas nose
source activate test-environment
conda install -c conda-forge cython pyyaml
conda install -c bioconda samtools
pip install pysam bioframe # workaround: incompatible with python3.10 if installed by conda
pip install click
python setup.py build_ext -i

Expand Down
45 changes: 36 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Documentation Status](https://readthedocs.org/projects/pairtools/badge/?version=latest)](http://pairtools.readthedocs.org/en/latest/)
[![Build Status](https://travis-ci.org/mirnylab/pairtools.svg?branch=master)](https://travis-ci.org/mirnylab/pairtools)
[![Join the chat at https://gitter.im/mirnylab/distiller](https://badges.gitter.im/mirnylab/distiller.svg)](https://gitter.im/mirnylab/distiller?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![Join the chat at Open2C Slack](https://join.slack.com/t/open2c/shared_invite/zt-1buzt2uyt-FLGp8~~MvbTCig5sQZTSNQ)](https://join.slack.com/t/open2c/shared_invite/zt-1buzt2uyt-FLGp8~~MvbTCig5sQZTSNQ)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1490831.svg)](https://doi.org/10.5281/zenodo.1490831)

## Process Hi-C pairs with pairtools
Expand All @@ -19,9 +19,12 @@ operations:
- generate extensive statistics of Hi-C datasets
- select Hi-C pairs given flexibly defined criteria
- restore .sam alignments from Hi-C pairs
- annotate restriction digestion sites
- get the mutated positions in Hi-C pairs

To get started:
- Take a look at a [quick example](https://github.com/open2c/pairtools#quick-example)
- Visit [pairtools tutorials](https://pairtools.readthedocs.io/en/latest/examples/pairtools_walkthrough.html),
- Take a look at a [quick example](https://github.com/open2c/pairtools#quick-example),
- Check out the detailed [documentation](http://pairtools.readthedocs.io).

## Data formats
Expand All @@ -32,18 +35,23 @@ format defined by the [4D Nucleome Consortium](https://www.4dnucleome.org/). All
pairtools properly manage file headers and keep track of the data
processing history.

Additionally, `pairtools` define the .pairsam format, an extension of .pairs that includes the SAM alignments
Additionally, `pairtools` define the [.pairsam format](https://pairtools.readthedocs.io/en/latest/formats.html#pairsam), an extension of .pairs that includes the SAM alignments
of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that
operates on .pairs files.

`pairtools` produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks.
The full list of possible extra columns is provided in the [`pairtools` format specification](https://pairtools.readthedocs.io/en/latest/formats.html#extra-columns).

## Installation

Requirements:

- Python 3.x
- Python packages `cython`, `numpy` and `click`.
- Python packages `cython`, `pysam`, `bioframe`, `pyyaml`, `numpy`, `scipy`, `pandas` and `click`.
- Command-line utilities `sort` (the Unix version), `bgzip` (shipped with `samtools`) and `samtools`. If available, `pairtools` can compress outputs with `pbgzip` and `lz4`.

For the full list of recommended versions, see [requirements in the the GitHub repo](https://github.com/open2c/pairtools/blob/detect_mutations/requirements.txt).

We highly recommend using the `conda` package manager to install `pairtools` together with all its dependencies. To get it, you can either install the full [Anaconda](https://www.continuum.io/downloads) Python distribution or just the standalone [conda](http://conda.pydata.org/miniconda.html) package manager.

With `conda`, you can install `pairtools` and all of its dependencies from the [bioconda](https://bioconda.github.io/index.html) channel.
Expand Down Expand Up @@ -91,21 +99,27 @@ Hi-C data analysis workflow, based on `pairtools` and [nextflow](https://www.nex

## Tools

- `parse`: read .sam files produced by bwa and form Hi-C pairs
- `parse`: read .sam/.bam files produced by bwa and form Hi-C pairs
- form Hi-C pairs by reporting the outer-most mapped positions and the strand
on the either side of each molecule;
- report unmapped/multimapped (ambiguous alignments)/chimeric alignments as
chromosome "!", position 0, strand "-";
- identify and rescue chrimeric alignments produced by singly-ligated Hi-C
molecules with a sequenced ligation junction on one of the sides;
- annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
- perform upper-triangular flipping of the sides of Hi-C molecules
such that the first side has a lower sorting index than the second side;
- form hybrid pairsam output, where each line contains all available data
for one Hi-C molecule (outer-most mapped positions on the either side,
read ID, pair type, and .sam entries for each alignment);
- report .sam tags or mutations of the alignments;
- print the .sam header as #-comment lines at the start of the file.

- `parse2`: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks
- identify and rescue chrimeric alignments produced by singly-ligated Hi-C
molecules with a sequenced ligation junction on one of the sides;
- annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
- perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;
- report index of the pair in the complex walk;
- make combinatorial expansion of pairs produced from the same walk;

- `sort`: sort pairs files (the lexicographic order for chromosomes,
the numeric order for the positions, the lexicographic order for pair types).

Expand All @@ -125,7 +139,10 @@ Hi-C data analysis workflow, based on `pairtools` and [nextflow](https://www.nex
- `dedup`: remove PCR duplicates from a sorted triu-flipped .pairs file
- remove PCR duplicates by finding pairs of entries with both sides mapped
to similar genomic locations (+/- N bp);
- optionally output the PCR duplicate entries into a separate file.
- optionally output the PCR duplicate entries into a separate file;
- detect optical duplicates from the original Illumina read ids;
- apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication;
- output yaml or convenient tsv deduplication stats into text file.
- NOTE: in order to remove all PCR duplicates, the input must contain \*all\*
mapped read pairs from a single experimental replicate;

Expand All @@ -136,10 +153,20 @@ Hi-C data analysis workflow, based on `pairtools` and [nextflow](https://www.nex

- `split`: split a .pairsam file into .pairs and .sam.

- `flip`: flip pairs to get an upper-triangular matrix

- `header`: manipulate the .pairs/.pairsam header
- generate new header for headerless .pairs file
- transfer header from one .pairs file to another
- set column names for the .pairs file
- validate that the header corresponds to the information stored in .pairs file

- `stats`: calculate various statistics of .pairs files

- `restrict`: identify the span of the restriction fragment forming a Hi-C junction

- `phase`: phase pairs mapped to a diploid genome

## Contributing

[Pull requests](https://akrabat.com/the-beginners-guide-to-contributing-to-a-github-project/) are welcome.
Expand Down
34 changes: 34 additions & 0 deletions doc/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,37 @@ Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string
.. |check| unicode:: U+2714 .. check
.. |cross| unicode:: U+274C .. cross

Extra columns
----------------

`pairtools` can operate on `.pairs/.pairsam` with extra columns.
Extra columns are specified in the order defined by the order their addition by various tools.
Column names can be checked in the header of `.pairs/.pairsam` file.
We provide `pairtools header` utilities for manipulating and verifying compatibility of headers and their columns.

The list of additional columns used throughout `pairtools` modules:

=================================== =================== ====================== ================================================== =================
extra column generating module format how to add it description
=================================== =================== ====================== ================================================== =================
mapq1, mapq2 `parse/parse2` number from 0 to 255 `pairtools parse --add-columns mapq` `Mapping quality <https://sequencing.qcfail.com/articles/mapq-values-are-really-useful-but-their-implementation-is-a-mess>`_, as reported in .sam/.bam, $-10 log_{10}(P_{error})$
pos51, pos52 `parse/parse2` genomic coordinate `pairtools parse --add-columns pos5` 5' position of alignment (closer to read start)
pos31, pos32 `parse/parse2` genomic coordinate `pairtools parse --add-columns pos3` 3' position of alignment (further from read start)
cigar1, cigar2 `parse/parse2` string `pairtools parse --add-columns cigar` `CIGAR, or Compact Idiosyncratic Gapped Alignment Report <https://en.wikipedia.org/wiki/Sequence_alignment#CIGAR_Format>`_ of alignment, as reported in .sam/.bam
read_len1, read_len2 `parse/parse2` number `pairtools parse --add-columns read_len` read length
matched_bp1, matched_bp2 `parse/parse2` number `pairtools parse --add-columns matched_bp` number of matched alignment basepairs to the reference
algn_ref_span1, algn_ref_span2 `parse/parse2` number `pairtools parse --add-columns algn_ref_span` basepairs of reference covered by alignment
algn_read_span1, algn_read_span2 `parse/parse2` number `pairtools parse --add-columns algn_read_span` basepairs of read covered by alignment
dist_to_51, dist_to_52 `parse/parse2` number `pairtools parse --add-columns dist_to_5` distance to 5'-end of read
dist_to_31, dist_to_32 `parse/parse2` number `pairtools parse --add-columns dist_to_3` distance to 3'-end of read
seq1, seq2 `parse/parse2` string `pairtools parse --add-columns seq` sequence of alignment
mismatches1, mismatches2 `parse/parse2` string `pairtools parse --add-columns mismatches` comma-separated list of mismatches relative to the reference, "{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}"
XB1/2,AS1/2,XS1/2 or any sam tag `parse/parse2` `pairtools parse --add-columns XA,XB,NM` format depends on `tag specification <https://samtools.github.io/hts-specs/SAMv1.pdf>`_
walk_pair_type `parse/parse2` string `pairtools parse2 --add-pair-index` Type of the pair relative to R1 and R2 reads of paired-end sequencing, see `pasring docs <https://pairtools.readthedocs.io/en/latest/parsing.html#rescuing-complex-walks>`_
walk_pair_index `parse/parse2` number `pairtools parse2 --add-pair-index` Order of the pair in the complex walk, starting from 5'-end of left read, see `pasring docs <https://pairtools.readthedocs.io/en/latest/parsing.html#rescuing-complex-walks>`_
phase `phase` 0, 1 or "." `pairtools phase` Phase of alignment (haplotype 1, 2, on unphased), see `phasing walkthrough <https://pairtools.readthedocs.io/en/latest/examples/pairtools_phase_walkthrough.html>`_
rfrag1, rfrag2 `restrict` number `pairtools restrict` Unique index of the restriction fragment after annotating pairs positions, see `restriction walkthrough <https://pairtools.readthedocs.io/en/latest/examples/pairtools_restrict_walkthrough.html>`_
rfrag_start1, rfrag_start2 `restrict` number `pairtools restrict` Coordinate of the start of restriction fragment
rfrag_end1, rfrag_end2 `restrict` number `pairtools restrict` Coordinate of the end of restriction fragment
=================================== =================== ====================== ================================================== =================

2 changes: 1 addition & 1 deletion doc/parsing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ position:
Sometimes it is important to restore the sequence of ligation events (e.g., for MC-3C data). For that, you can add
special columns ``walk_pair_index`` and ``walk_pair_type`` by setting ``--add-pair-index`` option of ``parse2``, that will keep the order and type of pair in the whole walk in the output .pairs file.

- ``walk_pair_index`` contains information on the order of the pair in the recovered walk, starting from 5'-end of left read
- ``walk_pair_index`` contains information on the order of the pair in the complex walk, starting from 5'-end of left read
- ``walk_pair_type`` describes the type of the pair relative to R1 and R2 reads of paired-end sequencing:

- "R1-2" - unconfirmed pair, right and left alignments in the pair originate from different reads (left or right). This might be indirect ligation (mediated by other DNA fragments).
Expand Down
19 changes: 3 additions & 16 deletions pairtools/cli/parse.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import click
import sys

Expand All @@ -13,20 +14,6 @@

UTIL_NAME = "pairtools_parse"

EXTRA_COLUMNS = [
"mapq",
"pos5",
"pos3",
"cigar",
"read_len",
"matched_bp",
"algn_ref_span",
"algn_read_span",
"dist_to_5",
"dist_to_3",
"seq",
]


@cli.command()
@click.argument("sam_path", type=str, required=False)
Expand Down Expand Up @@ -96,7 +83,7 @@
help="Report extra columns describing alignments "
"Possible values (can take multiple values as a comma-separated "
"list): a SAM tag (any pair of uppercase letters) or {}.".format(
", ".join(EXTRA_COLUMNS)
", ".join(pairsam_format.EXTRA_COLUMNS)
),
)
@click.option(
Expand Down Expand Up @@ -228,7 +215,7 @@ def parse_py(
add_columns = kwargs.get("add_columns", [])
add_columns = [col for col in add_columns.split(",") if col]
for col in add_columns:
if not ((col in EXTRA_COLUMNS) or (len(col) == 2 and col.isupper())):
if not ((col in pairsam_format.EXTRA_COLUMNS) or (len(col) == 2 and col.isupper())):
raise Exception("{} is not a valid extra column".format(col))

columns = pairsam_format.COLUMNS + (
Expand Down
27 changes: 7 additions & 20 deletions pairtools/cli/parse2.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,6 @@

UTIL_NAME = "pairtools_parse2"

EXTRA_COLUMNS = [
"mapq",
"pos5",
"pos3",
"cigar",
"read_len",
"matched_bp",
"algn_ref_span",
"algn_read_span",
"dist_to_5",
"dist_to_3",
"seq",
]


@cli.command()
@click.argument("sam_path", type=str, required=False)
# Parsing options:
Expand Down Expand Up @@ -145,9 +130,11 @@
@click.option(
"--flip/--no-flip",
is_flag=True,
default=True,
help="If specified, do not flip pairs in genomic order and instead preserve "
"the order in which they were sequenced.",
default=False,
help="If specified, flip pairs in genomic order and instead preserve "
"the order in which they were sequenced. Note that no flip is recommended for analysis of walks because it will "
"override the order of alignments in pairs. Flip is required for appropriate deduplication of sorted pairs. "
"Flip is not required for cooler cload, which runs flipping internally. ",
)
@click.option(
"--drop-readid/--keep-readid",
Expand Down Expand Up @@ -192,7 +179,7 @@
help="Report extra columns describing alignments "
"Possible values (can take multiple values as a comma-separated "
"list): a SAM tag (any pair of uppercase letters) or {}.".format(
", ".join(EXTRA_COLUMNS)
", ".join(pairsam_format.EXTRA_COLUMNS)
),
)
@click.option(
Expand Down Expand Up @@ -281,7 +268,7 @@ def parse2_py(
add_columns = kwargs.get("add_columns", [])
add_columns = [col for col in add_columns.split(",") if col]
for col in add_columns:
if not ((col in EXTRA_COLUMNS) or (len(col) == 2 and col.isupper())):
if not ((col in pairsam_format.EXTRA_COLUMNS) or (len(col) == 2 and col.isupper())):
raise Exception("{} is not a valid extra column".format(col))

columns = pairsam_format.COLUMNS + (
Expand Down
15 changes: 15 additions & 0 deletions pairtools/lib/pairsam_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,18 @@
UNMAPPED_STRAND = "-"

UNANNOTATED_RFRAG = -1

EXTRA_COLUMNS = [
"mapq",
"pos5",
"pos3",
"cigar",
"read_len",
"matched_bp",
"algn_ref_span",
"algn_read_span",
"dist_to_5",
"dist_to_3",
"seq",
"mismatches" # Format: "{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}"
]
Loading
Loading