This tool is designed to aid in the analysis of ribosomal footprint data (i.e., riboseq) for non-model archaea and bacteria and maximize input flexibility. As a starting point, this data handles raw, sorted bam files, fasta inputs for reference genome(s) that can be one or many contigs, and gene call positions in gff3 (prodigal) output.
Note that this tool requires perfect match between the contig names in the bam files and the gff / prodigal file with gene annotations. Different tools have different sensitivities to certain characters like whitespace or colons which may result in altered contig / reference names, so best case use letters, numbers, and the underscore for contig names but if other characters must be used, ensure they are appropriate for the upstream tools.
The simplest way is to through conda, like
conda create -y -n translational_profiler python pysam=0.22.1 biopython pandas matplotlib numpy
Older versions of pysam should work but have not been tested.
Currently, all functions are in the translational_profiler.py file which can be made available to any python script as
from translational_profiler.py import *
Examples and the code used to make the figures from the manuscript are found in figures.ipynb with more full description of the usage to follow.
Much appreciation for the developers of other ribosomal footprinting tools mRibo and HRIBO whose work influenced this project.
Arguments in square brackets optional. Arguments in double-square brackets optional but likely relevant (e.g. footprint-specific options)
bamfile="PATH"- path to a bam fileannfile="PATH"- path to a annotation file (GFF3 or prodigal output), or similar pandas dataframewindow=INT- number of bases around region / genecodon_resolved=BOOL- Assign coverage to a single reference base, not all bases spanned by the mapped readoffset=INT- Default 0, offset from 5' end for assigning coverage whencodon_resolved=Truefiveprime=BOOL- Default True, count from 5' end of read (False would be 3' end, automatically flipsoffset)rpm=BOOL- apply reads per million normalization, usually True/Falsefasta_path=PATH- path to a fasta file. Deflines should match bam contig names exactly
-
metagene(bamfile, annfile, window, [[codon_resolved=False, footprint_size=24, offset=0, fiveprime=True,]] [num_missing=0, meannorm=False, maxnorm=False, rpm=False, verbose=False])- Returns two 1d np arrays: first is start, second is stop.
-
plot_metagene(covs_start, covs_stop, bg_start=None, bg_stop=None, window=None, codonlines=True, axis_units=None, colors=['black','#dddddddd'], bg_fill=True)bg_fillis bool for whether background is a trace (good for foots / single-nt) or filled (default)
-
plot_genes(query, bamfile, annfile, footprint_size, txfile=None, window=50, offset=0, rpm=False, codon_resolved=True, cols=['red','#919100','green','blue','black'], axis_units=False, colors=['black','#dddddddd'], bg_fill=True, norenorm=False, y1lab='Footprints', y2lab='Coverage', genelabs=False)colorscontrol foreground/background colorscolscontrol gene colorsgenelabsdefaults false to not label genes, but can be a list like['mcrA', 'mcrD']etc from left to right.
-
codonhalves(bamfile, annfile, fasta_path, defline_trim=True, codon='M', query='', minrpm=0, rpm=True, fiveprime=True, codon_resolved=False, offset=0, footprint_size=range(18,30), normalgenes=True)- Returns two arrays, first being coverages before codon and second being coverages after codon
codonis the codon searched for. i.e.Mwould find coverage in windows around allMcodonsnormalgenessubsets to genes with anMstop and a*stop (UAG is not treated as a*)
-
codonslice(x, y, window=250)- Normalize output of
codonhalves()such that the cumulative sum of eachxandypair add up to 1 windowis the range around, works best at relatively short windows
- Normalize output of
-
plotPrePostAA(residue_dict, window=100, colors={'K':'green','W':'brown','P':'purple','O':'red', 'L':'gray', 'S':'orange', 'F':'pink','R':'#7a1a00','*':'#303030'}, n_best=10000, traces=False)- Takes a dict in the format
{'M': codonslice(), 'K': ...}and creates a plot n_bestis the number of 'best' codon windows to use if there are more thann_best. 'Best' windows are those that have a high read density, i.e. reads mapping aligning to different positions within the window, i.e. cumulative coverage is not flat.
- Takes a dict in the format
readDist(bams, [verbose=False])- histogram of read lengthsread_prodigal(annfile)- pandas dataframe from GFF or prodigal file, corrects to 0-indexfeatureCounts(bamfile, annfile, [paired=False, minlength=0, maxlength=np.inf])- mimic featureCounts for rpfcheckGeneSpacing(annfile, window, [inverse=False])- subset genes based on distance (window) to neighborgeneFromBam(bamfile, contig, start, stop, d, window, [[offset=0, footprint_size=24, fiveprime=True, codon_resolved=False]])- slice gene coverage from bam by coordinatestransformCoverages(cov, d, window, [rpm=False, meannorm=False, maxnorm=False])- transform coverage reported bygeneFromBamread_fasta(fasta, trim=False)- read fasta into a dict,trim=Truecuts contig deflines at whitespace to emulate bbmap bamsadd_stop_start_to_gff(annfile, fasta_path, trim=False)- adds columns 'stop_cdn', etc for each genewhereAreX(annfile, fasta_path, codon='O', trim=False)- adds list of CODON position(s) wherecodonis found