Skip to content

GitHub repository tutorial for XR-Seq and Damage-Seq methodologies, providing code examples, datasets, and documentation for studying DNA damage and repair processes.

License

Notifications You must be signed in to change notification settings

CompGenomeLab/XR-DS-seq-Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysis Tutorial for XR-seq and Damage-seq

This tutorial focuses on the manual analysis of XR-seq and Damage-seq data from raw reads to genome-wide tracks and quality control plots. Use the sections below for step-by-step commands.

Introduction

Prerequisites:

  • A GitHub account (for obtaining materials and sharing results). Create one at https://github.com/join.
  • Linux environment is recommended. If you are not on Linux, please follow our WSL setup guide: wsl_vscode_github.pdf.
  • Visual Studio Code is highly recommended (not mandatory) for the exercises and file navigation.
  • Conda environments should be installed and ready-to-use (see ENVIRONMENTS.md).

Outline:

  • Concepts: XR vs DS signals, strand conventions, lesion/products.
  • QC: FastQC/MultiQC.
  • Adapter handling: cutadapt (XR trim-and-keep; DS discard-trim).
  • Alignment and BAM ops: bowtie2, samtools, duplicate handling, MAPQ filtering.
  • BED conversion; chromosome/strand filtering.
  • Damage-seq motif filtering at lesion positions.
  • Coverage and normalization: bedGraph/bigWig per strand; quick IGV review.
  • QC and analyses: length distributions, nucleotide enrichment, replicate correlations, region-level profiles (deepTools).
  • Advanced: simulation-based controls with boquila.

Requirements:

  • Example FASTQs in samples/; outputs under results/.
  • Reference genome in ref_genome/ (this tutorial uses C. elegans WBcel235/ce11).
  • Example dataset (tutorial files are 1,000,000‑read subsamples):
    • Damage‑seq example (WT L1 0 h (6‑4)PP Rep1): GSM8595180
    • XR‑seq example (WT L1 5 min (6‑4)PP Rep1): GSM8595202

Introduction to XR-seq

Excision Repair sequencing (XR-seq) captures the short oligomers excised by nucleotide excision repair (NER). Key features:

  • Captures excision products produced by NER; signals reflect excision activity across the genome.
  • Fragment lengths are protocol- and organism-dependent (e.g., often ~24–32 nt after trimming in many eukaryotes, shorter in some prokaryotes); signal is strand-aware.
  • Interpreting XR-seq focuses on excision footprints and strand assignments rather than any single repair subpathway.
  • Analysis goals: adapter trimming (retain trimmed reads), alignment with short-read-aware parameters, deduplication/quality filtering, convert to BED with correct strandedness, generate normalized coverage (bigWig), and compute QA metrics (length distribution, nucleotide enrichment, correlations).

Introduction to Damage-seq

Damage-seq maps DNA lesions (e.g., CPD, (6-4)PP) by capturing polymerase-arrested sites. Key features:

  • Reads report lesion positions; for UV photoproducts (CPD, (6-4)PP), lesions occur at dipyrimidines; motif filtering improves specificity.
  • Lesion position is typically two bases upstream of the read’s 5′ end; downstream steps center windows accordingly before motif checks.
  • Adapter handling is diagnostic: presence of adapter indicates no polymerase arrest (no lesion), so discard-trim is used to keep only lesion-containing reads.
  • Analysis goals: adapter detection/removal (discard-trim), alignment, high-quality unique mappings, lesion-site coordinate definition, motif filtering, normalized coverage, and QA metrics (length distributions, motif/nucleotide enrichment).

Further reading

  • Hu, J., Li, W., Adebali, O., et al. Genome-wide mapping of nucleotide excision repair with XR-seq. Nature Protocols 14, 248–282 (2019). A detailed, step-by-step XR-seq laboratory and analysis protocol with practical considerations and preliminary analysis examples.
    Link: https://www.nature.com/articles/s41596-018-0093-7

  • Hu, J., Adebali, O., Adar, S., & Sancar, A. (2017). Dynamic maps of UV damage formation and repair for the human genome. Proceedings of the National Academy of Sciences, 114(26), 6758-6763. Damage-Seq Methodology Paper Link: https://www.pnas.org/doi/10.1073/pnas.1706522114

  • Hu, J., Adar, S., Selby, C. P., Lieb, J. D. & Sancar, A. Genome-wide analysis of human global and transcription-coupled excision repair of UV damage at single-nucleotide resolution. Genes & Development 29, 948–960 (2015). Introduces single-nucleotide–resolution repair maps, separating global and transcription-coupled repair components in human cells.
    Link: https://genesdev.cshlp.org/content/29/9/948

  • Adar, S., Hu, J., Lieb, J. D. & Sancar, A. Genome-wide kinetics of DNA excision repair in relation to chromatin state and mutagenesis. PNAS (2016). Quantifies excision repair kinetics genome-wide and relates repair dynamics to chromatin features and mutational patterns.
    Link: https://www.pnas.org/doi/10.1073/pnas.1614430113

  • Li, W., Hu, J., et al. (2017) PNAS. Extends genome-wide mapping by integrating UV damage formation and repair dynamics to reveal determinants of repair heterogeneity.
    Link: https://www.pnas.org/doi/10.1073/pnas.1706522114

NGS File Format

This section explains the file formats that you will encounter throughout the tutorial. For more details, see the UCSC File Format FAQ.

  • FASTQ:

    • The FASTQ format is used to store both the sequencing reads and their corresponding quality scores.
    • It consists of four lines per sequence read:
    • Line 1: Begins with a '@' symbol followed by a unique identifier for the read.
    • Line 2: Contains the actual nucleotide sequence of the read.
    • Line 3: Starts with a '+' symbol and may optionally contain the same unique identifier as Line 1.
    • Line 4: Contains the quality scores corresponding to the nucleotides in Line 2.
    • FASTQ files are commonly generated by sequencing platforms and are widely used as input for various NGS data analysis tools.
  • FASTA:

    • The FASTA format is a simple and widely used text-based format for representing nucleotide or protein sequences.
    • Each sequence in a FASTA file consists of two parts:
      • A single-line description or identifier that starts with a '>' symbol, followed by a sequence identifier or description.
      • The sequence data itself, which can span one or more lines.
    • The sequence lines contain the actual nucleotide or amino acid symbols representing the sequence.
    • FASTA files can store multiple sequences, each with its own identifier and sequence data.
    • FASTA format is commonly used for storing and exchanging sequences, such as reference genomes, gene sequences, or protein sequences.
    • Many bioinformatics tools and databases accept FASTA files as input for various sequence analysis tasks, including alignment, motif discovery, and similarity searches.
  • SAM/BAM:

    • SAM (Sequence Alignment/Map) is a text-based format used to store the alignment information of reads to a reference genome.
    • BAM (Binary Alignment/Map) is the binary equivalent of SAM, which is more compact and allows for faster data processing.
    • Both SAM and BAM files contain the same information, including read sequences, alignment positions, mapping qualities, and optional tags for additional metadata.
    • SAM/BAM files are crucial for downstream analysis tasks such as variant calling, differential expression analysis, and visualization in genome browsers.
    • SAM/BAM files can be generated using aligners like Bowtie2 or BWA.
  • BED:

    • BED (Browser Extensible Data) is a widely used format for representing genomic intervals or features.
    • It consists of a tab-separated plain-text file with columns representing different attributes of each genomic feature.
    • The standard BED format includes columns for chromosome, start position, end position, feature name, and additional optional columns for annotations or scores.
    • BED files are commonly used for visualizing and analyzing genomic features such as gene coordinates, binding sites, peaks, or regulatory regions.
    • BED files can be generated from BAM files using tools like BEDTools or samtools.
  • BEDGRAPH:

    • BEDGRAPH is a file format used to represent continuous numerical data across the genome, such as signal intensities, coverage, or scores.
    • Similar to BED files, BEDGRAPH files are plain-text files with columns representing genomic intervals and corresponding values.
    • The standard BEDGRAPH format includes columns for chromosome, start position, end position, and a numerical value representing the data for that interval.
    • BEDGRAPH files are typically used to visualize and analyze genome-wide data, such as ChIP-seq signal, DNA methylation levels, or RNA-seq coverage.
    • The values in a BEDGRAPH file can be positive or negative, representing different aspects of the data.
    • BEDGRAPH files can be generated from BAM files using tools like BEDTools or bedGraphToBigWig.
  • BigWig:

    • BigWig is a binary file format designed for efficient storage and retrieval of large-scale numerical data across the genome.
    • BigWig files are created from BEDGRAPH files and provide a compressed and indexed representation of the data, allowing for fast random access and visualization.
    • BigWig files store the data in a binary format and include additional indexing information for quick retrieval of data within specific genomic regions.
    • They are commonly used for visualizing and analyzing genome-wide data in genome browsers or for performing quantitative analyses across different genomic regions.
    • BigWig files can be generated from BEDGRAPH files using tools like bedGraphToBigWig or through conversion from other formats like BAM or WIG.

Initializing Git

If you have never used git commands on your local computer, go to your terminal and run the following.

git config --global user.email "EMAIL USED FOR GITHUB"
git config --global user.name "GITHUB USER NAME"

Clone GitHub Repository

git clone https://github.com/CompGenomeLab/XR-DS-seq-Tutorial.git

Navigate to the cloned directiory.

cd XR-DS-seq-Tutorial/

Create conda environments

See ENVIRONMENTS.md for creating and switching between the step-specific environments used in this tutorial. After creating the environments, ensure working directories exist:

mkdir -p ref_genome/Bowtie2 results qc

Downloading reference genome and generating related files

This section provides commands to download a reference genome and generate related files. For alignment and BAM operations, activate the mapping environment:

conda activate mapping

Next, prepare the reference genome. The compressed C. elegans (WBcel235/ce11) reference is already provided at ref_genome/GCF_000002985.6_WBcel235_genomic.fna.gz. Decompress it if needed:

gunzip -f ref_genome/GCF_000002985.6_WBcel235_genomic.fna.gz

After downloading the genome fasta file, you should generate the index files for bowtie2 which is crucial for efficiently reducing computational time and improving alignment quality.

bowtie2-build --threads 8 ref_genome/GCF_000002985.6_WBcel235_genomic.fna ref_genome/Bowtie2/genome_Celegans

You will create another index file via samtools to retrieve sequence information based on genomic coordinates or sequence identifiers.

samtools faidx ref_genome/GCF_000002985.6_WBcel235_genomic.fna

Next, the index file created by samtools will be converted to a ron file (a different format of the index). This file will be useful when we simulate our samples with boquila.

python3 scripts/idx2ron.py -i ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -o ref_genome/WBcel235.ron -l genome_idex2ron.log

Quality control

Here, you'll learn how to perform quality control on the NGS data using FastQC, a tool for assessing the quality of sequencing reads. A nice github page for the evaluation of FastQC results.

Activate the fastqc environment:

conda deactivate
conda activate fastqc

Then you can create a directory for the output of FastQC with mkdir qc/ command. Lastly, you can run the command below to execute the tool for both Damage-seq and XR-seq samples:

fastqc -t 8 --outdir qc/ samples/celeg_ds_64.fastq.gz
fastqc -t 8 --outdir qc/ samples/celeg_xr_64.fastq.gz

multiqc qc/ -o qc/

Adapter handling

This section demonstrates how to handle adapters in NGS data using cutadapt, a tool for trimming adapter sequences and other contaminants. Activate the trimming environment:

conda deactivate
conda activate trimming

At this stage, we will perform different tasks for Damage-seq and XR-seq. In Damage-seq, discard reads containing adapters (indicative of no lesion). In XR-seq, trim adapters and keep the trimmed reads.

cutadapt -j 8 -g GACTGGTTCCAATTGAAAGTGCTCTTCCGATCT --discard-trimmed -o results/ds.trim.fastq.gz samples/celeg_ds_64.fastq.gz

cutadapt -j 8 -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTNNNNNNACGATCTCGTATGCCGTCTTCTGCTTG -o results/xr.trim.fastq.gz samples/celeg_xr_64.fastq.gz

Mapping, removing duplicates, quality trimming, and converting to bed

This section covers the steps involved in mapping the preprocessed reads to the reference genome using Bowtie2, converting the mapped reads to BAM format, and extracting BED files.

Activate the mapping environment (includes bowtie2, samtools, picard):

conda deactivate
conda activate mapping

Initially we will align our reads to the reference genome using the prepared index files. After that we will convert the output sam files to bam.

bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_Celegans -U results/ds.trim.fastq.gz -S results/ds.sam

samtools view -Sbh -o results/ds.bam results/ds.sam

bowtie2 --threads 8 --seed 1 --reorder -x ref_genome/Bowtie2/genome_Celegans -U results/xr.trim.fastq.gz -S results/xr.sam

samtools view -Sbh -o results/xr.bam results/xr.sam

In the next part, we will remove the duplicate reads with picard.

To run Picard MarkDuplicates, you need your files to be ordered by their header. For that purpose, you should sort the files and then use picard.

samtools sort -o results/ds.sort.bam results/ds.bam -@ 8 -T results/

samtools sort -o results/xr.sort.bam results/xr.bam -@ 8 -T results/

picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/ds.sort.bam --TMP_DIR results/ --OUTPUT results/ds.dedup.bam --METRICS_FILE results/ds.dedup.metrics.txt

picard MarkDuplicates --REMOVE_DUPLICATES true --INPUT results/xr.sort.bam --TMP_DIR results/ --OUTPUT results/xr.dedup.bam --METRICS_FILE results/xr.dedup.metrics.txt

Lastly, we will remove low quality reads (MAPQ score < 20) via samtools and use bedtools to convert our bam files into bed format.

samtools index results/ds.dedup.bam
samtools view -q 20 -b results/ds.dedup.bam > results/ds.q20.bam

samtools index results/xr.dedup.bam
samtools view -q 20 -b results/xr.dedup.bam > results/xr.q20.bam

# Bedops env for BED conversion
conda deactivate
conda activate bedops
bedtools bamtobed -i results/ds.q20.bam > results/ds.bed
bedtools bamtobed -i results/xr.q20.bam > results/xr.bed

Sorting BED files

After obtaining the BED files, the reads in the BED files are sorted according to genomic coordinates.

sort -k1,1 -k2,2n -k3,3n results/ds.bed > results/ds.sorted.bed

sort -k1,1 -k2,2n -k3,3n results/xr.bed > results/xr.sorted.bed

Filtering for chromosomes

Reads that are aligned to regions other than chromosomes I-V and X are filtered and the rest is kept for further analysis.

grep "^chr" results/ds.sorted.bed | grep -v "chrDiscard" > results/ds.sorted.chr.bed

grep "^chr" results/xr.sorted.bed | grep -v "chrDiscard" > results/xr.sorted.chr.bed

Separating strands

The reads in the BED files are separated according to which strand they were mapped on.

    awk '{if($6=="+"){print}}' results/ds.sorted.chr.bed > results/ds.sorted.chr.plus.bed
    awk '{if($6=="-"){print}}' results/ds.sorted.chr.bed > results/ds.sorted.chr.minus.bed

    awk '{if($6=="+"){print}}' results/xr.sorted.chr.bed > results/xr.sorted.chr.plus.bed
    awk '{if($6=="-"){print}}' results/xr.sorted.chr.bed > results/xr.sorted.chr.minus.bed

Centering the damage site in for Damage-seq reads

Due to the experimental protocol of Damage-seq, the damaged dipyrimidines are located at the two bases upstream of the reads. We use bedtools flank and bedtools slop to obtain 10-nucleotide long read locations with the damaged nucleotides at 5th and 6th positions.

    cut -f1,2 ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai > ref_genome/sizes.chrom

    bedtools flank -i  results/ds.sorted.chr.plus.bed -g ref_genome/sizes.chrom -l 6 -r 0 > results/ds.plus.flank.bed
    bedtools flank -i results/ds.sorted.chr.minus.bed -g ref_genome/sizes.chrom -l 0 -r 6 > results/ds.minus.flank.bed
    bedtools slop -i results/ds.plus.flank.bed -g ref_genome/sizes.chrom -l 0 -r 4 > results/ds.plus.10.bed
    bedtools slop -i results/ds.minus.flank.bed -g ref_genome/sizes.chrom -l 4 -r 0 > results/ds.minus.10.bed
    cat results/ds.plus.10.bed results/ds.minus.10.bed > results/ds.10.bed

Obtaining FASTA files from BED files

To convert BED files to FASTA format:

    bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.plus.10.bed -fo results/ds.plus.10.fa -s
    bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.minus.10.bed -fo results/ds.minus.10.fa -s

Filtering Damage-seq data by motif

Damage-seq reads are expected to contain C and T at specific positions around the lesion. Use this as an additional filter to remove reads that do not meet this criterion.

    python3 scripts/fa2bedByChoosingReadMotifs.py -i results/ds.plus.10.fa -o results/ds.dipy.plus.bed -r '.{4}(c|t|C|T){2}.{4}'
    python3 scripts/fa2bedByChoosingReadMotifs.py -i results/ds.minus.10.fa -o results/ds.dipy.minus.bed -r '.{4}(c|t|C|T){2}.{4}'

    cat results/ds.dipy.plus.bed results/ds.dipy.minus.bed > results/ds.dipy.bed

Obtaining read length distribution and read count

To assess whether read lengths match expectations for XR-seq or Damage-seq, compute the read length distribution.

    awk '{print $3-$2}' results/ds.sorted.chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/ds.len.txt

    awk '{print $3-$2}' results/xr.sorted.chr.bed | sort -k1,1n | uniq -c | sed 's/\s\s*/ /g' | awk '{print $2"\t"$1}' > results/xr.len.txt

In addition, we count the reads in the BED files and use this count in the following steps.

    grep -c "^" results/ds.dipy.bed > results/ds.dipy.count.txt
    grep -c "^" results/xr.sorted.chr.bed > results/xr.count.txt

Generating BigWig files

BigWig file format includes representation of the distribution reads in each genomic window without respect to plus and minus strands. It is a compact file format that is required for many tools in the further analysis steps.

The first step for generation of BigWig files from BED files is generating BedGraph files. Since the strands of the reads are not taken into account in the BigWig and BedGraph files

    bedtools genomecov -i results/ds.dipy.plus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/ds.dipy.count.txt | awk '{print 1000000/$1}') > results/ds.dipy.plus.bdg
    
    bedtools genomecov -i results/ds.dipy.minus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/ds.dipy.count.txt | awk '{print 1000000/$1}') > results/ds.dipy.minus.bdg

    bedtools genomecov -i results/xr.sorted.chr.plus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.count.txt | awk '{print 1000000/$1}') > results/xr.plus.bdg        
    
    bedtools genomecov -i results/xr.sorted.chr.minus.bed -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.count.txt | awk '{print 1000000/$1}') > results/xr.minus.bdg

Then, from these BedGraph files, we generate BigWig files.

    # Sort BedGraphs (still in bedops env)
    sort -k1,1 -k2,2n results/ds.dipy.plus.bdg > results/ds.dipy.plus.sorted.bdg
    sort -k1,1 -k2,2n results/ds.dipy.minus.bdg > results/ds.dipy.minus.sorted.bdg
    
    sort -k1,1 -k2,2n results/xr.plus.bdg > results/xr.plus.sorted.bdg
    sort -k1,1 -k2,2n results/xr.minus.bdg > results/xr.minus.sorted.bdg
    # Convert to BigWig (ucsc env)
    conda deactivate
    conda activate ucsc
    bedGraphToBigWig results/ds.dipy.plus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/ds.dipy.plus.bw
    bedGraphToBigWig results/ds.dipy.minus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/ds.dipy.minus.bw

    bedGraphToBigWig results/xr.plus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.plus.bw
    bedGraphToBigWig results/xr.minus.sorted.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.minus.bw

Simulating the sample reads

Because CPD and (6-4)PP damage types require certain nucleotides in certain positions, the genomic locations rich in adjacent CC, TC, CT or TT dinucleotides may be prone to receiving more UV damage while other regions that are poor in these dinucleotides receive less damage. Therefore, the sequence contents may bias our analysis results while comparing the damage formation or NER efficiency of two genomic regions. In order to eliminate the effect of sequence content, we create synthetic sequencing data from the real Damage-seq and XR-seq data, which give us the expected damage counts and NER efficiencies, respectively, from the sequence content of the genomic areas of interest. We use Boquila to generate simulated data.

    conda deactivate
    conda activate bedops
    bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/ds.dipy.bed -fo results/ds.dipy.fa -s

    awk '($3-$2==24){print}' results/xr.sorted.chr.bed > results/xr.sorted.24.bed
    bedtools getfasta -fi ref_genome/GCF_000002985.6_WBcel235_genomic.fna -bed results/xr.sorted.24.bed -fo results/xr.fa -s

    conda deactivate
    conda activate boquila
    boquila --fasta results/ds.dipy.fa --bed results/ds.sim.bed --ref ref_genome/GCF_000002985.6_WBcel235_genomic.fna --regions ref_genome/WBcel235.ron --kmer 2 --seed 1 > results/ds.sim.fa

    boquila --fasta results/xr.fa --bed results/xr.sim.bed --ref ref_genome/GCF_000002985.6_WBcel235_genomic.fna --regions ref_genome/WBcel235.ron --kmer 2 --seed 1 > results/xr.sim.fa
    awk '{if($6=="+"){print}}' results/ds.sim.bed > results/ds.sim.plus.bed
    awk '{if($6=="-"){print}}' results/ds.sim.bed > results/ds.sim.minus.bed

    awk '{if($6=="+"){print}}' results/xr.sim.bed > results/xr.sim.plus.bed
    awk '{if($6=="-"){print}}' results/xr.sim.bed > results/xr.sim.minus.bed

The read counts from the simulated Damage-seq and XR-seq data are then used to normalize our real Damage-seq and XR-seq data to eliminate the sequence content bias.

    grep -c '^' results/xr.sim.bed > results/xr.sim.count.txt

    conda deactivate
    conda activate bedops
    sort -k1,1 -k2,2n results/xr.sim.plus.bed | grep -v "^chrDiscard" | bedtools genomecov -i - -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.sim.count.txt | awk '{print 1000000/$1}') > results/xr.sim.plus.bdg
    
    sort -k1,1 -k2,2n results/xr.sim.minus.bed | grep -v "^chrDiscard" | bedtools genomecov -i - -g ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai -bg -scale $(cat results/xr.sim.count.txt | awk '{print 1000000/$1}') > results/xr.sim.minus.bdg

    conda deactivate
    conda activate ucsc
    bedGraphToBigWig results/xr.sim.plus.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.sim.plus.bw
    bedGraphToBigWig results/xr.sim.minus.bdg ref_genome/GCF_000002985.6_WBcel235_genomic.fna.fai results/xr.sim.minus.bw

Plotting length distribution, nucleotide enrichment, and BAM correlations

Plotting read length distribution

Plot a histogram of read lengths to confirm expected fragment sizes for XR/DS libraries.

    conda deactivate
    conda activate rplots
    library(ggplot2)
    read_len <- read.table("results/xr.len.txt")
    colnames(read_len) <- c("length", "counts")
    xrLenDistPlot <- ggplot(data = read_len, aes(x = length, y = counts)) +
        geom_bar(stat='identity') +
        xlab("Read length") +
        ylab("Read count")
    ggsave("results/xrLenDistPlot.png")
    q()

Assessing and Plotting nucleotide enrichment of the reads

We check the reads for the enrichment of nucleotides and dinucleotides at specific positions.

    conda deactivate
    conda activate bedops
    python3 scripts/fa2kmerAbundanceTable.py -i results/ds.dipy.fa -k 1 -o results/ds.dipy.nt.txt
    python3 scripts/fa2kmerAbundanceTable.py -i results/ds.dipy.fa -k 2 -o results/ds.dipy.di.txt

    python3 scripts/fa2kmerAbundanceTable.py -i results/xr.fa -k 1 -o results/xr.nt.txt
    python3 scripts/fa2kmerAbundanceTable.py -i results/xr.fa -k 2 -o results/xr.di.txt

Because Damage-seq and XR-seq have characteristic sequence features, plot per-position nucleotide enrichment as a quality check.

    conda deactivate
    conda activate rplots
    Rscript scripts/plot_kmer_content.R -i results/xr.nt.txt -k 1 -o results/xr_nuc.png -s "XR-seq nucleotide content" -l results/xr_nuc.log
    Rscript scripts/plot_kmer_content.R -i results/ds.dipy.nt.txt -k 1 -o results/ds_nuc.png -s "DS-seq nucleotide content" -l results/ds_nuc.log
    
    Rscript scripts/plot_kmer_content.R -i results/ds.dipy.di.txt -k 2 -o results/ds_dinuc.png -s "DS-seq dinucleotide content" -l results/ds_dinuc.log -f "CC,CT,TC,TT"
    Rscript scripts/plot_kmer_content.R -i results/xr.di.txt -k 2 -o results/xr_dinuc.png -s "XR-seq dinucleotide content" -l results/xr_dinuc.log -f "CC,CT,TC,TT"

Bam correlations

Assess data quality by comparing replicates and conditions. Replicates should correlate well; different damage types/products should show distinct profiles.

    conda deactivate
    conda activate deeptools

    multiBamSummary bins --bamfiles results/ds.dedup.bam results/xr.dedup.bam --minMappingQuality 20 --outFileName results/bamSummary.npz
    
    plotCorrelation -in results/bamSummary.npz --corMethod spearman --skipZeros --plotTitle "Spearman Correlation of Read Counts" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o results/bamCorr.png

Plotting read distribution on genes

deepTools summarizes coverage across regions (e.g., genes). Provide BigWigs and a BED of regions of interest:

    bigwigCompare --bigwig1 results/xr.plus.bw --bigwig2 results/xr.sim.plus.bw --operation ratio --outFileFormat bigwig --outFileName results/xr.norm_sim.plus.bw

    computeMatrix scale-regions -S results/xr.norm_sim.plus.bw -R ref_genome/celeg_genes.bed --outFileName results/xr.norm_sim.plus.on_genes.computeMatrix.out -b 1000 -a 1000 --smartLabels

    plotHeatmap --matrixFile results/xr.norm_sim.plus.on_genes.computeMatrix.out --outFileName results/xr.norm_sim.plus.on_genes.heatmap_k2.png --xAxisLabel "Position with respect to genes" --yAxisLabel "Genes" --kmeans 2 --heatmapWidth 10 --startLabel "Start" --endLabel "End"

Automated workflow

If you prefer an automated version of the steps in this tutorial, see the Snakemake workflow:

It orchestrates genome preparation, QC, adapter handling, alignment, BED conversions, Damage-seq motif filtering, coverage track generation, simulations, and downstream summaries. Use it when you need a fully reproducible, end-to-end run with minimal manual intervention.

About

GitHub repository tutorial for XR-Seq and Damage-Seq methodologies, providing code examples, datasets, and documentation for studying DNA damage and repair processes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •