This tutorial focuses on the manual analysis of XR-seq and Damage-seq data from raw reads to genome-wide tracks and quality control plots. Use the sections below for step-by-step commands.
Prerequisites:
- A GitHub account (for obtaining materials and sharing results). Create one at https://github.com/join.
- Linux environment is recommended. If you are not on Linux, please follow our WSL setup guide: wsl_vscode_github.pdf.
- Visual Studio Code is highly recommended (not mandatory) for the exercises and file navigation.
- Conda environments should be installed and ready-to-use (see ENVIRONMENTS.md).
Outline:
- Concepts: XR vs DS signals, strand conventions, lesion/products.
- QC: FastQC/MultiQC.
- Adapter handling: cutadapt (XR trim-and-keep; DS discard-trim).
- Alignment and BAM ops: bowtie2, samtools, duplicate handling, MAPQ filtering.
- BED conversion; chromosome/strand filtering.
- Damage-seq motif filtering at lesion positions.
- Coverage and normalization: bedGraph/bigWig per strand; quick IGV review.
- QC and analyses: length distributions, nucleotide enrichment, replicate correlations, region-level profiles (deepTools).
- Advanced: simulation-based controls with boquila.
Requirements:
- Example FASTQs in
samples/; outputs underresults/. - Reference genome in
ref_genome/(this tutorial uses C. elegans WBcel235/ce11). - Example dataset (tutorial files are 1,000,000‑read subsamples):
- Damage‑seq example (WT L1 0 h (6‑4)PP Rep1): GSM8595180
- XR‑seq example (WT L1 5 min (6‑4)PP Rep1): GSM8595202
Excision Repair sequencing (XR-seq) captures the short oligomers excised by nucleotide excision repair (NER). Key features:
- Captures excision products produced by NER; signals reflect excision activity across the genome.
- Fragment lengths are protocol- and organism-dependent (e.g., often ~24–32 nt after trimming in many eukaryotes, shorter in some prokaryotes); signal is strand-aware.
- Interpreting XR-seq focuses on excision footprints and strand assignments rather than any single repair subpathway.
- Analysis goals: adapter trimming (retain trimmed reads), alignment with short-read-aware parameters, deduplication/quality filtering, convert to BED with correct strandedness, generate normalized coverage (bigWig), and compute QA metrics (length distribution, nucleotide enrichment, correlations).
Damage-seq maps DNA lesions (e.g., CPD, (6-4)PP) by capturing polymerase-arrested sites. Key features:
- Reads report lesion positions; for UV photoproducts (CPD, (6-4)PP), lesions occur at dipyrimidines; motif filtering improves specificity.
- Lesion position is typically two bases upstream of the read’s 5′ end; downstream steps center windows accordingly before motif checks.
- Adapter handling is diagnostic: presence of adapter indicates no polymerase arrest (no lesion), so discard-trim is used to keep only lesion-containing reads.
- Analysis goals: adapter detection/removal (discard-trim), alignment, high-quality unique mappings, lesion-site coordinate definition, motif filtering, normalized coverage, and QA metrics (length distributions, motif/nucleotide enrichment).
-
Hu, J., Li, W., Adebali, O., et al. Genome-wide mapping of nucleotide excision repair with XR-seq. Nature Protocols 14, 248–282 (2019). A detailed, step-by-step XR-seq laboratory and analysis protocol with practical considerations and preliminary analysis examples.
Link: https://www.nature.com/articles/s41596-018-0093-7 -
Hu, J., Adebali, O., Adar, S., & Sancar, A. (2017). Dynamic maps of UV damage formation and repair for the human genome. Proceedings of the National Academy of Sciences, 114(26), 6758-6763. Damage-Seq Methodology Paper Link: https://www.pnas.org/doi/10.1073/pnas.1706522114
-
Hu, J., Adar, S., Selby, C. P., Lieb, J. D. & Sancar, A. Genome-wide analysis of human global and transcription-coupled excision repair of UV damage at single-nucleotide resolution. Genes & Development 29, 948–960 (2015). Introduces single-nucleotide–resolution repair maps, separating global and transcription-coupled repair components in human cells.
Link: https://genesdev.cshlp.org/content/29/9/948 -
Adar, S., Hu, J., Lieb, J. D. & Sancar, A. Genome-wide kinetics of DNA excision repair in relation to chromatin state and mutagenesis. PNAS (2016). Quantifies excision repair kinetics genome-wide and relates repair dynamics to chromatin features and mutational patterns.
Link: https://www.pnas.org/doi/10.1073/pnas.1614430113 -
Li, W., Hu, J., et al. (2017) PNAS. Extends genome-wide mapping by integrating UV damage formation and repair dynamics to reveal determinants of repair heterogeneity.
Link: https://www.pnas.org/doi/10.1073/pnas.1706522114
This section explains the file formats that you will encounter throughout the tutorial. For more details, see the UCSC File Format FAQ.
-
FASTQ:
- The FASTQ format is used to store both the sequencing reads and their corresponding quality scores.
- It consists of four lines per sequence read:
- Line 1: Begins with a '@' symbol followed by a unique identifier for the read.
- Line 2: Contains the actual nucleotide sequence of the read.
- Line 3: Starts with a '+' symbol and may optionally contain the same unique identifier as Line 1.
- Line 4: Contains the quality scores corresponding to the nucleotides in Line 2.
- FASTQ files are commonly generated by sequencing platforms and are widely used as input for various NGS data analysis tools.
-
FASTA:
- The FASTA format is a simple and widely used text-based format for representing nucleotide or protein sequences.
- Each sequence in a FASTA file consists of two parts:
- A single-line description or identifier that starts with a '>' symbol, followed by a sequence identifier or description.
- The sequence data itself, which can span one or more lines.
- The sequence lines contain the actual nucleotide or amino acid symbols representing the sequence.
- FASTA files can store multiple sequences, each with its own identifier and sequence data.
- FASTA format is commonly used for storing and exchanging sequences, such as reference genomes, gene sequences, or protein sequences.
- Many bioinformatics tools and databases accept FASTA files as input for various sequence analysis tasks, including alignment, motif discovery, and similarity searches.
-
SAM/BAM:
- SAM (Sequence Alignment/Map) is a text-based format used to store the alignment information of reads to a reference genome.
- BAM (Binary Alignment/Map) is the binary equivalent of SAM, which is more compact and allows for faster data processing.
- Both SAM and BAM files contain the same information, including read sequences, alignment positions, mapping qualities, and optional tags for additional metadata.
- SAM/BAM files are crucial for downstream analysis tasks such as variant calling, differential expression analysis, and visualization in genome browsers.
- SAM/BAM files can be generated using aligners like Bowtie2 or BWA.
-
BED:
- BED (Browser Extensible Data) is a widely used format for representing genomic intervals or features.
- It consists of a tab-separated plain-text file with columns representing different attributes of each genomic feature.
- The standard BED format includes columns for chromosome, start position, end position, feature name, and additional optional columns for annotations or scores.
- BED files are commonly used for visualizing and analyzing genomic features such as gene coordinates, binding sites, peaks, or regulatory regions.
- BED files can be generated from BAM files using tools like BEDTools or samtools.
-
BEDGRAPH:
- BEDGRAPH is a file format used to represent continuous numerical data across the genome, such as signal intensities, coverage, or scores.
- Similar to BED files, BEDGRAPH files are plain-text files with columns representing genomic intervals and corresponding values.
- The standard BEDGRAPH format includes columns for chromosome, start position, end position, and a numerical value representing the data for that interval.
- BEDGRAPH files are typically used to visualize and analyze genome-wide data, such as ChIP-seq signal, DNA methylation levels, or RNA-seq coverage.
- The values in a BEDGRAPH file can be positive or negative, representing different aspects of the data.
- BEDGRAPH files can be generated from BAM files using tools like BEDTools or bedGraphToBigWig.
-
BigWig:
- BigWig is a binary file format designed for efficient storage and retrieval of large-scale numerical data across the genome.
- BigWig files are created from BEDGRAPH files and provide a compressed and indexed representation of the data, allowing for fast random access and visualization.
- BigWig files store the data in a binary format and include additional indexing information for quick retrieval of data within specific genomic regions.
- They are commonly used for visualizing and analyzing genome-wide data in genome browsers or for performing quantitative analyses across different genomic regions.
- BigWig files can be generated from BEDGRAPH files using tools like bedGraphToBigWig or through conversion from other formats like BAM or WIG.
If you have never used git commands on your local computer, go to your terminal and run the following.
git config --global user.email "EMAIL USED FOR GITHUB"
git config --global user.name "GITHUB USER NAME"git clone https://github.com/CompGenomeLab/XR-DS-seq-Tutorial.gitNavigate to the cloned directiory.
cd XR-DS-seq-Tutorial/See ENVIRONMENTS.md for creating and switching between the step-specific environments used in this tutorial. After creating the environments, ensure working directories exist:
mkdir -p ref_genome/Bowtie2 results qcThis section provides commands to download a reference genome and generate related files. For alignment and BAM operations, activate the mapping environment:
conda activate mappingNext, prepare the reference genome.
The compressed C. elegans (WBcel235/ce11) reference is already provided at ref_genome/GCF_000002985.6_WBcel235_genomic.fna.gz. Decompress it if needed:
gunzip -f ref_genome/GCF_000002985.6_WBcel235_genomic.fna.gzAfter downloading the genome fasta file, you should generate the index files for bowtie2 which is crucial for efficiently reducing computational time and improving alignment quality.
bowtie2-build --threads 8 ref_genome/GCF_000002985.6_WBcel235_genomic.fna ref_genome/Bowtie2/genome_CelegansYou will create another index file via samtools to retrieve sequence information based on genomic coordinates or sequence identifiers.
samtools faidx ref_genome/GCF_000002985.6_WBcel235_genomic.fnaNext, the index file created by samtools will be converted to a ron file (a different format of the index). This file will be useful when we simulate our samples with boquila.
python3 scripts/idx2ron.py -i ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -o ref_genome/WBcel235.ron -l genome_idex2ron.logHere, you'll learn how to perform quality control on the NGS data using FastQC, a tool for assessing the quality of sequencing reads. A nice github page for the evaluation of FastQC results.
Activate the fastqc environment:
conda deactivate
conda activate fastqcThen you can create a directory for the output of FastQC with mkdir qc/ command.
Lastly, you can run the command below to execute the tool for both Damage-seq and XR-seq samples:
fastqc -t 8 --outdir qc/ samples/celeg_ds_64.fastq.gz
fastqc -t 8 --outdir qc/ samples/celeg_xr_64.fastq.gz
multiqc qc/ -o qc/This section demonstrates how to handle adapters in NGS data using cutadapt, a tool for trimming adapter sequences and other contaminants. Activate the trimming environment:
conda deactivate
conda activate trimmingAt this stage, we will perform different tasks for Damage-seq and XR-seq. In Damage-seq, discard reads containing adapters (indicative of no lesion). In XR-seq, trim adapters and keep the trimmed reads.
cutadapt -j 8 -g GACTGGTTCCAATTGAAAGTGCTCTTCCGATCT --discard-trimmed -o results/ds.trim.fastq.gz samples/celeg_ds_64.fastq.gz
cutadapt -j 8 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTNNNNNNACGATCTCGTATGCCGTCTTCTGCTTG -o results/xr.trim.fastq.gz samples/celeg_xr_64.fastq.gzThis section covers the steps involved in mapping the preprocessed reads to the reference genome using Bowtie2, converting the mapped reads to BAM format, and extracting BED files.
Activate the mapping environment (includes bowtie2, samtools, picard):
conda deactivate
conda activate mappingInitially we will align our reads to the reference genome using the prepared index files. After that we will convert the output sam files to bam.
bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_Celegans -U results/ds.trim.fastq.gz -S results/ds.sam
samtools view -Sbh -o results/ds.bam results/ds.sam
bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_Celegans -U results/xr.trim.fastq.gz -S results/xr.sam
samtools view -Sbh -o results/xr.bam results/xr.samIn the next part, we will remove the duplicate reads with picard.
To run Picard MarkDuplicates, you need your files to be ordered by their header. For that purpose, you should sort the files and then use picard.
samtools sort -o results/ds.sort.bam results/ds.bam -@ 8 -T results/
samtools sort -o results/xr.sort.bam results/xr.bam -@ 8 -T results/
picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/ds.sort.bam --TMP_DIR results/ --OUTPUT results/ds.dedup.bam --METRICS_FILE results/ds.dedup.metrics.txt
picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/xr.sort.bam --TMP_DIR results/ --OUTPUT results/xr.dedup.bam --METRICS_FILE results/xr.dedup.metrics.txtLastly, we will remove low quality reads (MAPQ score < 20) via samtools and use bedtools to convert our bam files into bed format.
samtools index results/ds.dedup.bam
samtools view -q 20 -b results/ds.dedup.bam > results/ds.q20.bam
samtools index results/xr.dedup.bam
samtools view -q 20 -b results/xr.dedup.bam > results/xr.q20.bam
# Bedops env for BED conversion
conda deactivate
conda activate bedops
bedtools bamtobed -i results/ds.q20.bam > results/ds.bed
bedtools bamtobed -i results/xr.q20.bam > results/xr.bedAfter obtaining the BED files, the reads in the BED files are sorted according to genomic coordinates.
sort -k1,1 -k2,2n -k3,3n results/ds.bed > results/ds.sorted.bed
sort -k1,1 -k2,2n -k3,3n results/xr.bed > results/xr.sorted.bedReads that are aligned to regions other than chromosomes I-V and X are filtered and the rest is kept for further analysis.
grep "^chr" results/ds.sorted.bed | grep -v "chrDiscard" > results/ds.sorted.chr.bed
grep "^chr" results/xr.sorted.bed | grep -v "chrDiscard" > results/xr.sorted.chr.bedThe reads in the BED files are separated according to which strand they were mapped on.
awk '{if($6=="+"){print}}' results/ds.sorted.chr.bed > results/ds.sorted.chr.plus.bed
awk '{if($6=="-"){print}}' results/ds.sorted.chr.bed > results/ds.sorted.chr.minus.bed
awk '{if($6=="+"){print}}' results/xr.sorted.chr.bed > results/xr.sorted.chr.plus.bed
awk '{if($6=="-"){print}}' results/xr.sorted.chr.bed > results/xr.sorted.chr.minus.bedDue to the experimental protocol of Damage-seq, the damaged dipyrimidines are located at the two bases upstream of the reads. We use bedtools flank and bedtools slop to obtain 10-nucleotide long read locations with the damaged nucleotides at 5th and 6th positions.
cut -f1,2 ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai > ref_genome/sizes.chrom
bedtools flank -i results/ds.sorted.chr.plus.bed -g ref_genome/sizes.chrom -l 6 -r 0 > results/ds.plus.flank.bed
bedtools flank -i results/ds.sorted.chr.minus.bed -g ref_genome/sizes.chrom -l 0 -r 6 > results/ds.minus.flank.bed bedtools slop -i results/ds.plus.flank.bed -g ref_genome/sizes.chrom -l 0 -r 4 > results/ds.plus.10.bed
bedtools slop -i results/ds.minus.flank.bed -g ref_genome/sizes.chrom -l 4 -r 0 > results/ds.minus.10.bed cat results/ds.plus.10.bed results/ds.minus.10.bed > results/ds.10.bedTo convert BED files to FASTA format:
bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.plus.10.bed -fo results/ds.plus.10.fa -s
bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.minus.10.bed -fo results/ds.minus.10.fa -sDamage-seq reads are expected to contain C and T at specific positions around the lesion. Use this as an additional filter to remove reads that do not meet this criterion.
python3 scripts/fa2bedByChoosingReadMotifs.py -i results/ds.plus.10.fa -o results/ds.dipy.plus.bed -r '.{4}(c|t|C|T){2}.{4}'
python3 scripts/fa2bedByChoosingReadMotifs.py -i results/ds.minus.10.fa -o results/ds.dipy.minus.bed -r '.{4}(c|t|C|T){2}.{4}'
cat results/ds.dipy.plus.bed results/ds.dipy.minus.bed > results/ds.dipy.bedTo assess whether read lengths match expectations for XR-seq or Damage-seq, compute the read length distribution.
awk '{print $3-$2}' results/ds.sorted.chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/ds.len.txt
awk '{print $3-$2}' results/xr.sorted.chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/xr.len.txtIn addition, we count the reads in the BED files and use this count in the following steps.
grep -c "^" results/ds.dipy.bed > results/ds.dipy.count.txt
grep -c "^" results/xr.sorted.chr.bed > results/xr.count.txtBigWig file format includes representation of the distribution reads in each genomic window without respect to plus and minus strands. It is a compact file format that is required for many tools in the further analysis steps.
The first step for generation of BigWig files from BED files is generating BedGraph files. Since the strands of the reads are not taken into account in the BigWig and BedGraph files
bedtools genomecov -i results/ds.dipy.plus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/ds.dipy.count.txt | awk '{print 1000000/$1}') > results/ds.dipy.plus.bdg
bedtools genomecov -i results/ds.dipy.minus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/ds.dipy.count.txt | awk '{print 1000000/$1}') > results/ds.dipy.minus.bdg
bedtools genomecov -i results/xr.sorted.chr.plus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.count.txt | awk '{print 1000000/$1}') > results/xr.plus.bdg
bedtools genomecov -i results/xr.sorted.chr.minus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.count.txt | awk '{print 1000000/$1}') > results/xr.minus.bdgThen, from these BedGraph files, we generate BigWig files.
# Sort BedGraphs (still in bedops env)
sort -k1,1 -k2,2n results/ds.dipy.plus.bdg > results/ds.dipy.plus.sorted.bdg
sort -k1,1 -k2,2n results/ds.dipy.minus.bdg > results/ds.dipy.minus.sorted.bdg
sort -k1,1 -k2,2n results/xr.plus.bdg > results/xr.plus.sorted.bdg
sort -k1,1 -k2,2n results/xr.minus.bdg > results/xr.minus.sorted.bdg # Convert to BigWig (ucsc env)
conda deactivate
conda activate ucsc
bedGraphToBigWig results/ds.dipy.plus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/ds.dipy.plus.bw
bedGraphToBigWig results/ds.dipy.minus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/ds.dipy.minus.bw
bedGraphToBigWig results/xr.plus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.plus.bw
bedGraphToBigWig results/xr.minus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.minus.bwBecause CPD and (6-4)PP damage types require certain nucleotides in certain positions, the genomic locations rich in adjacent CC, TC, CT or TT dinucleotides may be prone to receiving more UV damage while other regions that are poor in these dinucleotides receive less damage. Therefore, the sequence contents may bias our analysis results while comparing the damage formation or NER efficiency of two genomic regions. In order to eliminate the effect of sequence content, we create synthetic sequencing data from the real Damage-seq and XR-seq data, which give us the expected damage counts and NER efficiencies, respectively, from the sequence content of the genomic areas of interest. We use Boquila to generate simulated data.
conda deactivate
conda activate bedops
bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.dipy.bed -fo results/ds.dipy.fa -s
awk '($3-$2==24){print}' results/xr.sorted.chr.bed > results/xr.sorted.24.bed
bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/xr.sorted.24.bed -fo results/xr.fa -s
conda deactivate
conda activate boquila
boquila --fasta results/ds.dipy.fa --bed results/ds.sim.bed --ref ref_genome/GCF_000002985.6_WBcel235_genomic.fna --regions ref_genome/WBcel235.ron --kmer 2 --seed 1 > results/ds.sim.fa
boquila --fasta results/xr.fa --bed results/xr.sim.bed --ref ref_genome/GCF_000002985.6_WBcel235_genomic.fna --regions ref_genome/WBcel235.ron --kmer 2 --seed 1 > results/xr.sim.fa awk '{if($6=="+"){print}}' results/ds.sim.bed > results/ds.sim.plus.bed
awk '{if($6=="-"){print}}' results/ds.sim.bed > results/ds.sim.minus.bed
awk '{if($6=="+"){print}}' results/xr.sim.bed > results/xr.sim.plus.bed
awk '{if($6=="-"){print}}' results/xr.sim.bed > results/xr.sim.minus.bedThe read counts from the simulated Damage-seq and XR-seq data are then used to normalize our real Damage-seq and XR-seq data to eliminate the sequence content bias.
grep -c '^' results/xr.sim.bed > results/xr.sim.count.txt
conda deactivate
conda activate bedops
sort -k1,1 -k2,2n results/xr.sim.plus.bed | grep -v "^chrDiscard" | bedtools genomecov -i - -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.sim.count.txt | awk '{print 1000000/$1}') > results/xr.sim.plus.bdg
sort -k1,1 -k2,2n results/xr.sim.minus.bed | grep -v "^chrDiscard" | bedtools genomecov -i - -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.sim.count.txt | awk '{print 1000000/$1}') > results/xr.sim.minus.bdg
conda deactivate
conda activate ucsc
bedGraphToBigWig results/xr.sim.plus.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.sim.plus.bw
bedGraphToBigWig results/xr.sim.minus.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.sim.minus.bwPlot a histogram of read lengths to confirm expected fragment sizes for XR/DS libraries.
conda deactivate
conda activate rplots library(ggplot2)
read_len <- read.table("results/xr.len.txt")
colnames(read_len) <- c("length", "counts")
xrLenDistPlot <- ggplot(data = read_len, aes(x = length, y = counts)) +
geom_bar(stat='identity') +
xlab("Read length") +
ylab("Read count")
ggsave("results/xrLenDistPlot.png")
q()We check the reads for the enrichment of nucleotides and dinucleotides at specific positions.
conda deactivate
conda activate bedops
python3 scripts/fa2kmerAbundanceTable.py -i results/ds.dipy.fa -k 1 -o results/ds.dipy.nt.txt
python3 scripts/fa2kmerAbundanceTable.py -i results/ds.dipy.fa -k 2 -o results/ds.dipy.di.txt
python3 scripts/fa2kmerAbundanceTable.py -i results/xr.fa -k 1 -o results/xr.nt.txt
python3 scripts/fa2kmerAbundanceTable.py -i results/xr.fa -k 2 -o results/xr.di.txtBecause Damage-seq and XR-seq have characteristic sequence features, plot per-position nucleotide enrichment as a quality check.
conda deactivate
conda activate rplots
Rscript scripts/plot_kmer_content.R -i results/xr.nt.txt -k 1 -o results/xr_nuc.png -s "XR-seq nucleotide content" -l results/xr_nuc.log
Rscript scripts/plot_kmer_content.R -i results/ds.dipy.nt.txt -k 1 -o results/ds_nuc.png -s "DS-seq nucleotide content" -l results/ds_nuc.log
Rscript scripts/plot_kmer_content.R -i results/ds.dipy.di.txt -k 2 -o results/ds_dinuc.png -s "DS-seq dinucleotide content" -l results/ds_dinuc.log -f "CC,CT,TC,TT"
Rscript scripts/plot_kmer_content.R -i results/xr.di.txt -k 2 -o results/xr_dinuc.png -s "XR-seq dinucleotide content" -l results/xr_dinuc.log -f "CC,CT,TC,TT"Assess data quality by comparing replicates and conditions. Replicates should correlate well; different damage types/products should show distinct profiles.
conda deactivate
conda activate deeptools
multiBamSummary bins --bamfiles results/ds.dedup.bam results/xr.dedup.bam --minMappingQuality 20 --outFileName results/bamSummary.npz
plotCorrelation -in results/bamSummary.npz --corMethod spearman --skipZeros --plotTitle "Spearman Correlation of Read Counts" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o results/bamCorr.pngdeepTools summarizes coverage across regions (e.g., genes). Provide BigWigs and a BED of regions of interest:
bigwigCompare --bigwig1 results/xr.plus.bw --bigwig2 results/xr.sim.plus.bw --operation ratio --outFileFormat bigwig --outFileName results/xr.norm_sim.plus.bw
computeMatrix scale-regions -S results/xr.norm_sim.plus.bw -R ref_genome/celeg_genes.bed --outFileName results/xr.norm_sim.plus.on_genes.computeMatrix.out -b 1000 -a 1000 --smartLabels
plotHeatmap --matrixFile results/xr.norm_sim.plus.on_genes.computeMatrix.out --outFileName results/xr.norm_sim.plus.on_genes.heatmap_k2.png --xAxisLabel "Position with respect to genes" --yAxisLabel "Genes" --kmeans 2 --heatmapWidth 10 --startLabel "Start" --endLabel "End"If you prefer an automated version of the steps in this tutorial, see the Snakemake workflow:
- Repository: CompGenomeLab/xr-ds-seq-snakemake
It orchestrates genome preparation, QC, adapter handling, alignment, BED conversions, Damage-seq motif filtering, coverage track generation, simulations, and downstream summaries. Use it when you need a fully reproducible, end-to-end run with minimal manual intervention.