Skip to content
Eun Ji Kim edited this page Jul 20, 2018 · 5 revisions

  1. Installation
    a. Installing PORT
    b. Installing sam2cov
  2. Input requirements
    a. Input files
    b. Input directory structure
    c. Configuration file

1. Installation

a. Installing PORT

b. Installing sam2cov (optional)

You can use sam2cov to create coverage files and upload them to a Genome Browser.

Currently, sam2cov only supports reads aligned with RUM, STAR or GSNAP. sam2cov supports stranded data, but it assumes the reverse read is in the same orientation as the transcripts/genes (sense). 

You can download sam2cov from https://github.com/khayer/sam2cov

Please make sure you have the latest version of sam2cov.

c. Installing samtools

You can download samtools from http://samtools.sourceforge.net/


2. Input requirements

a. Input files

i. Raw sequence reads
  • Raw sequence reads need to be provided as input in Fasta or Fastq format.
  • Expects one file per sample for single-end data and two files per sample for paired-end data.
  • Fasta/Fastq files can be gzipped.
ii. Alignment files
  • Alignment files need to be provided as input in SAM or BAM format.
  • Required tags: IH (or NH) and HI.
  • Expects one alignment file per sample.
  • Paired-End data: Mated alignments need to be in adjacent lines.
    • aligner options to use for PORT compatibility:
      • STAR v2.5.1a or newer: use "--outSAMunmapped Within KeepPairs" option.
        use "--outSAMtype BAM Unsorted" for bam output.
      • GSNAP 2015-12-31.v6 or newer: use "-A sam", "--ordered" and "--add-paired-nomappers" option.
  • Paired-End data: Read pairs need to have same read ids.
iii. Gene information file
  • Gene information file with required suffixes need to be provided.

    • The required suffixes are: name, chrom, strand, txStart, txEnd, exonStarts, exonEnds, name2, geneSymbol, ensemblToGeneName.value.
  • Gene-level normalization requires an ENSEMBL gene information file.

  • Ensembl gene info files for mm9, mm10, hg19, hg38, dm3 and danRer7 are available in Normalization/norm_scripts directory:

     mm9: /path/to/Normalization/norm_scripts/mm9_ensGenes.txt  
     mm10: /path/to/Normalization/norm_scripts/Mus_musculus.GRCm38.84.PORT_geneinfo.txt  
     hg19: /path/to/Normalization/norm_scripts/hg19_ensGenes.txt  
     hg38: /path/to/Normalization/norm_scripts/Homo_sapiens.GRCh38.84.PORT_geneinfo.txt  
     dm3: /path/to/Normalization/norm_scripts/dm3_ensGenes.txt  
     danRer7: /path/to/Normalization/norm_scripts/danRer7_ensGenes.txt  
    
  • A script to convert an ENSEMBL gtf file to a gene information file is available in Normalization/norm_scripts directory.

    • /path/to/Normalization/norm_scripts/convert_gtf_to_PORT_geneinfo.transcripts.pl
iv. Genome fa/fai
  • The description line (the header line that begins with ">") MUST begin with chromosome names that match the chromosome names in Gene information file.

Please check and modify the file appropriately before starting PORT.

  • You can get the index file (*.fai) using samtools
    • samtools faidx <ref.fa>

b. Input directory structure

The input files need to be organized into a specific directory structure for PORT to run properly.

  • Give STUDY directory a unique name.
  • Sample directories (Sample_1, Sample_2, etc) can have any name.
  • Make sure the raw sequence reads and alignment files (SAM/BAM files) are in each sample directory inside the READS folder.
  • All alignment files MUST have the same name across samples.

Example:

STUDY
└── READS
    ├── Sample_1
    │   ├── Unaligned reads
    │   └── Aligned.sam/bam
    ├── Sample_2
    │   ├── Unaligned reads
    │   └── Aligned.sam/bam
    ├── Sample_3
    │   ├── Unaligned reads
    │   └── Aligned.sam/bam
    └── Sample_4
        ├── Unaligned reads
        └── Aligned.sam/bam

c. Configuration file

  • Get the template_version.cfg file from /path/to/Normalization/ and follow the instructions in the config file.
    • NORMALIZATION TYPE, DATA TYPE (stranded), CLUSTER INFO, GENE INFO, rRNA, FA and FAI, DATA VISUALIZATION and CLEANUP options need to be specified.
  • PORT is designed to be run on a compute cluster. It has been tested on SGE and LSF.
  • See about_cfg.md for detail.

next: How to run PORT