- python 2.7 (Anaconda's python distribution comes with the required numpy and scipy libraries)
- pysam
- bowtie2
- samtools
- Ensure that both
samtoolsandbowtie2are added to path and can be called directly from bash
- Raw ChIP-seq FASTQ files and processed BAM files can be downloaded from SRA.
- Download sequencing reads in FASTQ format from SRA
- Download human prebuilt bowtie2 indices
- Human hg38
- move to the corresponding folder named
hg38_bowtie2/
- Download the human (hg38) genome assembly in FASTA format
- hg38.fa
- extract from gz files
- move to the corresponding folder named
hg38_bowtie2/
- Generate FASTA file indices
samtools faidx hg38_bowtie2/hg38.fa
- Modify the first lines of bash script
process_reads.sh- Fill the list variable
filelistwith paths to each sample name for processing. Note that the actual file paths include the sample name followed by "_1.fastq" or "_2.fastq", denoting read1 or read2 of paired-end reads, respectively. - Fill in the path to the indexed genome denoted by variable
genomepath.
- Fill the list variable
- Run the following code snippet, where
-pdenotes the number of samples to process in parallel; modify accordingly. This performs genome alignment, filtering, sorting, removal of PCR duplicates, indexing, and sample statistics output. This step takes less than one day to complete on our Intel i7-8700K, 32GB RAM desktop, though speed appears to be bottlenecked by read/writes to disk.bash process_reads.sh -p 6 - (optional) For fair comparison between different time points in a timeseries, we subset reads from all relevant samples to the sample with the fewest reads of the set. Run the following code snippet, where
-sinputs the number of mapped reads to subset for each sample; modify accordingly. This step takes less than 10 minutes.bash subset_reads.sh -p 6 -s 24400000- for 53BP1 DNA-PKcs inhibitor experiments, subsetted to 9,045,179 for both replicates.
- for γH2AX DNA-PKcs inhibitor experiments, subsetted to 13,474,505 for replicate 1 and 7,626,728 for replicate 2.
- for MRE11 DNA-PKcs inhibitor experiments, subsetted to 5,507,748 for replicate 1 and 8,057,929 for replicate 2.
- for 53BP1 timeseries experiments, subsetted to 8,481,406 for replicate 1 and 6,504,200 for replicate 2.
- for MRE11 timeseries experiments, subsetted to 2,686,969 for replicate 1 and 6,207,727 for replicate 2.
In addition to raw paired-end reads in FASTQ format, we have also uploaded pre-processed sequencing reads in BAM format to SRA. These are the output of the previous section. It is highly recommended to start from these BAM files.
- Download pre-processed paired-end reads in BAM format from SRA.
- Move the downloaded data for MRE11 and γH2AX ChIP-seq to the desired folder.
- If not already done, index the downloaded BAM files:
samtools index /path/to/output.bam
- Set
basevariable to be the path to the directory that holds the BAM files. - Create a new folder to hold the output of 53BP1 and γH2AX scripts, set its path to
bs_a. - Create a new folder to hold the output of MRE11 scripts, set its path to
bs_s. - Ensure that all file names are correct (if the BAM files were directly downloaded from SRA, they should be), then run script.