This project contains a comprehensive RNA-seq analysis pipeline using TopHat for read alignment and Cufflinks for transcript assembly. The analysis compares gene expression between two developmental time points (Day 8 and Day 16) in Arabidopsis thaliana chromosome 4.
Transcriptomic/
├── data/ # Reference genome and raw data
│ ├── athal_chr.fa # Chromosome 4 reference sequence
│ ├── athal_genes.gtf # Gene annotations
│ ├── Day8.fastq & Day16.fastq # RNA-seq reads
│ └── atha_chr.*.bt2 # Bowtie2 index files
├── commands/ # Analysis scripts
│ ├── com.tophat # TopHat alignment commands
│ ├── com.cufflinks # Cufflinks assembly commands
│ ├── cuffdiff.sh # Differential expression pipeline
│ └── GTFs.txt # Input file list for cuffmerge
└── result/
├── tophat/ # Alignment results
│ └── tophat/
│ ├── Day8/ # Day 8 alignments (accepted_hits.bam)
│ └── Day16/ # Day 16 alignments (accepted_hits.bam)
├── cufflinks/ # Transcript assemblies
│ ├── Day8/ # Day 8 expression data (transcripts.gtf, *.fpkm_tracking)
│ └── Day16/ # Day 16 expression data (transcripts.gtf, *.fpkm_tracking)
├── Cuffmerge/ # Merged transcript annotation
│ └── merged.gtf # Combined transcript models from both conditions
└── Cuffdiff/ # Differential expression analysis
├── gene_exp.diff # Gene-level differential expression results
├── isoform_exp.diff # Transcript-level differential expression
└── *.tracking files # Expression tracking across conditions
The pipeline processes paired RNA-seq datasets:
- Day8.fastq: Early developmental stage (63,573 reads)
- Day16.fastq: Later developmental stage (57,985 reads)
TopHat performs splice-aware alignment of RNA-seq reads to the reference genome, with options for both guided (with GTF) and unguided alignment.
Key Results:
- Day 8: 99.8-99.9% mapping efficiency (63,476-63,489 mapped reads)
- Day 16: 99.9% mapping efficiency (57,943-57,951 mapped reads)
- Spliced alignments: 217-220 (Day 8), 314-319 (Day 16)
Cufflinks reconstructs transcript models from aligned reads and estimates expression levels (FPKM values).
Discovery Summary:
| Sample | Genes | Transcripts | Single-exon | Multi-exon | Single-transcript genes |
|---|---|---|---|---|---|
| Day 8 | 186 | 193 | 114 | 64 | 166 |
| Day 16 | 81 | 93 | 21 | 72 | 63 |
Transcriptomic/
├── data/ # Reference genome and raw data
│ ├── athal_chr.fa # Chromosome 4 reference sequence
│ ├── athal_genes.gtf # Gene annotations
│ ├── Day8.fastq & Day16.fastq # RNA-seq reads
│ └── atha_chr.*.bt2 # Bowtie2 index files
├── commands/ # Analysis scripts
│ ├── com.tophat # TopHat alignment commands
│ ├── com.cufflinks # Cufflinks assembly commands
│ ├── cuffdiff.sh # Differential expression pipeline
│ └── GTFs.txt # Input file list for cuffmerge
└── result/
├── tophat/ # Alignment results
│ └── tophat/
│ ├── Day8/ # Day 8 alignments (accepted_hits.bam)
│ └── Day16/ # Day 16 alignments (accepted_hits.bam)
├── cufflinks/ # Transcript assemblies
│ ├── Day8/ # Day 8 expression data (transcripts.gtf, *.fpkm_tracking)
│ └── Day16/ # Day 16 expression data (transcripts.gtf, *.fpkm_tracking)
├── Cuffmerge/ # Merged transcript annotation
│ └── merged.gtf # Combined transcript models from both conditions
└── Cuffdiff/ # Differential expression analysis
├── gene_exp.diff # Gene-level differential expression results
├── isoform_exp.diff # Transcript-level differential expression
└── *.tracking files # Expression tracking across conditions
The analysis reveals significant differences between developmental stages:
- Gene Discovery: Day 8 shows higher transcriptional activity with 186 genes vs 81 in Day 16
- Splice Complexity: Day 16 exhibits more complex splicing patterns (77% multi-exon vs 36% in Day 8)
- Mapping Quality: Both samples achieve excellent alignment rates (>99.8%)
The pipeline consists of two main stages:
- Alignment: Run TopHat to map reads to the reference genome
- Assembly: Use Cufflinks to reconstruct transcripts and estimate expression
All commands are documented in the commands/ directory for reproducible analysis.
- align_summary.txt: Mapping statistics and quality metrics
- accepted_hits.bam: Successfully aligned reads
- junctions.bed: Identified splice junctions
- genes.fpkm_tracking: Gene-level expression estimates
- isoforms.fpkm_tracking: Transcript-level expression data
- transcripts.gtf: Reconstructed transcript models
This pipeline provides a robust framework for comparative transcriptome analysis and differential gene expression studies.
Following transcript assembly, Cuffmerge combines transcript models from both conditions, and Cuffdiff performs statistical comparison of gene expression levels.
Pipeline Steps:
- Cuffmerge: Merged Day 8 and Day 16 transcript assemblies into unified annotation
- Cuffdiff: Compared expression levels between developmental stages
Sample Comparison (Day 8 vs Day 16):
- Total genes analyzed: 119 loci processed
- Expression range: 0 to 152,925 FPKM
- Highly expressed genes: Several genes show >10,000 FPKM expression levels
Notable Expression Changes:
- AT4G19200: 152,925 → 11,994 FPKM (12.7x decrease, log2FC = -3.67)
- AT4G19210: 39,997 → 20,236 FPKM (2x decrease, log2FC = -0.98)
- AT4G19230: 445 → 2,295 FPKM (5x increase, log2FC = +2.37)
Statistical Significance:
- No genes reached statistical significance (q-value < 0.05)
- Limited statistical power due to single biological replicate per condition
- Results suggest biological trends but require additional replicates for validation
Key Observations:
- Expression magnitude: Day 8 shows higher overall expression levels
- Gene activity: Many genes with minimal expression in both conditions (NOTEST status)
- Developmental patterns: Clear expression changes between time points, though not statistically validated
Here, I perform RNA_Seq analysis to determine genes that are differentioally expressed at different stages in the development of Arabidopsis thaliana shoot apical meristem. The samples are collected at day 8, 16. Data files are including ("Day8.fastq" and "Day16.fastq") the samples are collected, extracted and being sequenced the cellular mRNA, and now set to perform the bioinformatics analysis.
The refernce genome is under athal_chr.fa and the reference gene annotation are in "athal_genes.gtf".