Skip to content

helianfeixing/ISSAAC-seqV2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ISSAAC-seq preprocessing pipeline v2

ISSAAC-seq is a single-cell multiomic method for simultaneous profiling of chromatin accessibility and gene expression from the same cell. This repository contains an updated version of the preprocessing pipeline compared to the original publication back in 2022.

Raw Sequencing Data

The 10x Genomics workflow generates data in the same format as the original publication, and you can find the raw data (fastq files) from ArrayExpress under the accession number E-MTAB-11264. For the plate-based workflow, we now put the well barcode to the i5 position so that it has the same library format as the 10x Genomics workflow, and the raw data generated with PBMCs from a healthy individual can be found under the accession number E-MTAB-15131. The Bio-Rad workflow has a different library format, and the raw data can be found under the same accession number E-MTAB-15131.

Analysing your own data

The pipeline chains a number of commonly used programmes together, which means we need to make sure all of them are properly set up. Luckily, we only need to do this once at the very beginning of our project.

Step 1: Software Installation

The following programmes and packages are needed to run the full pipeline. Make sure they are executable and in our $PATH so that they can be invoked directly without having to specify the absolute path.

To convert the bdg file generated by MACS2 to the bw format for read pileup visualisation on the UCSC genome browser, we also need the bdg2bw script. In this pipeline, a modified version of the script is used, and the script is provided in the misc_files folder. We need to move it to our $PATH and make sure it is executable.

The pipeline also uses python3 and the following python packages, all of which can be installed via miniconda or the like:

scipy v1.11.4
numpy v1.26.0
pandas v2.1.2
matplotlib v3.9.1

Step 2: Preparation of genome files

Once we have finished installing all the programmes and packages, we need to prepare some files related to the genome of our interest. First, we generate star index for the genome:

STAR \
    --runThreadN <num_threads> \
    --runMode genomeGenerate \
    --genomeDir <output_dir_that_will_host_the_output_files> \
    --genomeFastaFiles <path_to_the_genome_fasta_file> \
    --sjdbGTFfile <path_to_the_gene_annotation_gtf_file>

Then we generate the chromap index for the genome:

chromap \
    -i \
    -t <num_threads> \
    -r <path_to_the_genome_fasta_file> \
    -o <output_index_file>

Next, we need to provide a whitelist for the cell barcodes, which are the index sequences on the gel beads (droplet) or the i5 position of the primers (plate). Since we are using the 10x Single Cell ATAC kit, we need the 16-bp cell barcodes from the cellranger-atac software. I have prepared this file together with our plate barcode related files in the whitelist directory of this repository, like this:

whitelist/
├── 737K-cratac-v1_rc.txt                            # 10x single cell atac whitelist
├── ISSAACv2_plate_whitelist.tsv                     # whitelist for plate-based method (16 bp)
├── ISSAACv2_plate_whitelist_first8.tsv              # whitelist for plate-based method (first 8 bp, useful when only 8 cycles are sequenced)
├── ISSAACv2_plate_whitelist_with_wellid.tsv         # tab-delimited file indicating the well position of each barcode (16 bp), used to draw qc
└── ISSAACv2_plate_whitelist_first8_with_wellid.tsv  # tab-delimited file indicating the well position of each barcode (the first 8 bp), used to draw qc

Then, we need to prepare the blacklist regions for the genome, which is the problematic regions with abnormally high signals in many sequencing experiments. The details can be found in the ENCODE blacklist paper and the corresponding GitHub repository. I have provided the blacklist regions of human (hg38) and mouse (mm10) under the misc_files directory of this repository.

Another file that is needed in various steps within the pipeline is the chromosomal sizes file. It is a tab-delimited file with two columns. The first column is the name of the chromosome, and the second column is the size in bp of the chromosome. We can generate the file of the genome of our interest using the binary faSize and the genome fasta file like this:

faSize -detailed <genome_fasta> | \
    sort -k1,1 > chrom.sizes.sorted.tsv

I have provided the files for human (hg38) and mouse (mouse) under the misc_files directory of this repository.

Step 3: Preparation of your own data

For your own data, you need the fastq files. Both the plate and the droplet workflows of ISSAAC-seq generate the same library layout as the 10x Genomics Single Cell ATAC kit. Therefore, when we send our libraries for sequencing in our genomic core facility or a service provider company, we need to tell them the libraries are from 10x Genomics Single Cell ATAC kit. Then they should know what to do and return three fastq files to us.

If we are sequencing the libraries by ourselves, we need create a file called SampleSheet.csv. I have put two example sample sheets under the misc_files directory, one for NextSeq and the other for NovaSeq. Once the sequencing is finished, we could run bcl2fastq in the following way:

# change the Y150 to the read length of your choice
bcl2fastq --use-bases-mask=Y150,I8,Y16,Y150 \
    --create-fastq-for-index-reads \
    --no-lane-splitting \
    --ignore-missing-positions \
    --ignore-missing-controls \
    --ignore-missing-filter \
    --ignore-missing-bcls \
    --minimum-trimmed-read-length 0 \
    --mask-short-adapter-reads 0 \
    -r 4 -w 4 -p 4 \
    -o <output_dir>

which would yield four fastq files under <output_dir>:

<sample_name>_I1_001.fastq.gz
<sample_name>_R1_001.fastq.gz
<sample_name>_R2_001.fastq.gz
<sample_name>_R3_001.fastq.gz

The <sample_name>_I1_001.fastq.gz file can be safely ignored, and we should put the other three files into the fastq directory of this repository to start the pipeline. Here, I have provide example files of both ATAC and RNA with 1 million reads so that you can test the pipeline.

Step 4: Modify config.json

Now we need to modify a few things in the config.json file to specify the locations of certain files and the parameters of the run. We need to change the following content according to out computing environment:

{
    "rna_cdna_fq" : "<path to RNA_R1_001.fastq.gz>",
    "rna_cb_fq" : "<path to RNA_R2_001.fastq.gz>",
    "rna_umi_fq" : "<path to RNA_R3_001.fastq.gz>",
    "atac_r1_fq" : "<path to ATAC_R1_001.fastq.gz>",
    "atac_cb_fq" : "<path to ATAC_R1_001.fastq.gz>",
    "atac_r2_fq" : "<path to ATAC_R1_001.fastq.gz>",
    "whitelist" : "<path to 737K-cratac-v1_rc.txt or ISSAACv2_plate_whitelist.tsv>",
    "chromap_bin" : "<path to the chromap binary>",
    "chromap_idx" : "<path to genome.index generated by chromap>",
    "chromap_fa" : "<path to the reference genome fasta>",
    "blacklist" : "<path to the ENCODE blacklist>",
    "chrom_size" : "<path to chrom.sizes.sorted.tsv>",
    "star_bin" : "<path to the STAR binary>",
    "star_idx" : "<path to the directory that contains the STAR index files>",
    "mmNmax" : "20",
    "star_ft" : "GeneFull",
    "star_trim3" : "<number of nt that will be trimmed from the 3' end of the cDNA reads>",
    "star_configbc" : "CB_UMI_Simple",
    "star_cbumipos" : "--soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10",
    "star_strand" : "Forward",
    "star_samattr" : "NH HI CB UB GX GN",
    "gsize" : "<this is the macs2 -g flag>",
    "macs2_shift" : "--nomodel --shift -37 --extsize 75"
}

Step 5: Run the pipeline

Once we have done all the previous steps, our working directory should look like this:

Experiment
├── bsub.sh
├── cluster.json
├── config.json
├── fastq
│   ├── mCortex-ATAC_R1_001.fastq.gz
│   ├── mCortex-ATAC_R2_001.fastq.gz
│   ├── mCortex-ATAC_R3_001.fastq.gz
│   ├── mCortex-RNA_R1_001.fastq.gz
│   ├── mCortex-RNA_R2_001.fastq.gz
│   └── mCortex-RNA_R3_001.fastq.gz
├── scripts
│   ├── collect_atac_qc.py
│   └── generate_csc_mtx.py
├── Snakefile
└── submit_snake.sh

To run the pipeline using all available cores, simply run:

snakemake --cores

We can also submit the jobs by bsub if we use IBM LSF. This can be done by simply run bash bsub.sh which will inovke submit_snake.sh and use the configurations specified in the cluster.json file to request resource. If LSF is not used, those three files can be safely ignored.

Citations

If you use the preprocessing pipeline described here, please cite the following two papers together:

  1. Xu W, Yang W, Zhang Y, Chen Y, Hong N, Zhang Q, Wang X, Hu Y, Song K, Jin W, Chen X (2022) ISSAAC-seq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat Methods 19:1243–1249. https://doi.org/10.1038/s41592-022-01601-4
  2. Manuscript under review, to be updated soon.

Contact

Xi Chen
chenx9@sustech.edu.cn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.9%
  • Shell 6.1%