This repository contains a modular workflow for performing joint variant discovery following GATK4 Best Practices.
The pipeline is organized into three sequential steps, each encapsulated in its own script, plus a wrapper to run the full process end-to-end.
Note: The scripts are under active development. Logic, style, and usage will converge as the modules mature.
- Step 01 — FASTQ Alignment (in development)
- Input: raw FASTQ reads
- Process: align reads to a reference genome (using
bwa mem), produce raw BAM/CRAM files, generate alignment statistics. - Output: coordinate-sorted, indexed BAM files.
- Step 02 — Mapped BAM QC (this script)
- Input: mapped BAM from Step 01
- Process: perform read group assignment, duplicate removal, mapping quality filtering, base quality recalibration, and coverage estimation.
- Output: quality-controlled, analysis-ready BAM files.
- Step 03 — Joint Variant Calling (in development)
- Input: QC’d BAMs from Step 02
- Process: call SNPs/indels, generate GVCFs, perform joint genotyping, and filter variants.
- Output: multi-sample VCF/BCF ready for downstream analysis.
- Wrapper Script (in development)
- Runs steps 01 → 03 in tandem, managing intermediate outputs and sanity checks.
This module prepares mapped BAMs for downstream variant calling, ensuring high-quality alignments and base quality scores.
gatk_mapped_bam_qc.sh [INPUT_BAM] [OUTPUT_PATH]INPUT_BAM: BAM file aligned to reference genome (output of Step 01).OUTPUT_PATH: destination folder for QC outputs (default: current working directory).
The script produces:
${BAM}.rg.bam— BAM with read groups assigned${BAM}.rmdup.bam— duplicate-removed BAM${BAM}.rmdup.mqfilt.bam— mapping-quality filtered BAM${BAM}.rmdup.mqfilt.bqsr.bam— recalibrated BAM ready for variant calling- Coverage and quality metrics (
.txt,.pdf,.summary.txt)
-
Assign Read Groups
GATKAddOrReplaceReadGroupsfor downstream compatibility. -
Remove Duplicates
GATKMarkDuplicatesSparkto discard PCR duplicates. -
Mapping Quality Filtering
samtools viewwith mapping quality threshold (-q 30). -
Base Quality Recalibration (BQSR)
GATKBaseRecalibratorandApplyBQSRusing known variant sites. -
Coverage Calculation
mosdepthsummarizing per-sample coverage. -
Optional Metrics
GATKCollectAlignmentSummaryMetricsandCollectInsertSizeMetrics.
The pipeline expects the following tools available in $PATH (modules will be loaded automatically if run on the FENIX remote HPC system):
gatk(v4.x)samtools(>=1.15)mosdepth(>=0.3.x)fastqc(>=0.12.1)R(for optional plots)
Reference data required (configured via config/config.yaml):
- Reference genome FASTA (
ref_gnm) with index - Known variant sites VCF (
ref_vars)
# Step 01 - Align FASTQs (placeholder)
bash align_fastq_reads.sh sample_R1.fastq.gz sample_R2.fastq.gz ./results/align/
# Step 02 - QC mapped BAM
bash gatk_mapped_bam_qc.sh ./results/align/sample.bam ./results/qc/
# Step 03 - Joint variant calling (placeholder)
bash joint_variant_calling.sh ./results/qc/ ./results/variants/
# Wrapper - Full pipeline run
bash run_joint_calling_pipeline.sh sample_config.yaml- Consistency: Each script follows the same input-output convention and sanity checks.
- Extensibility: Steps are modular; users can run only the portions relevant to their analysis.
- Portability: Scripts detect whether they are running on a local machine or a remote HPC environment and adapt accordingly.
- Pavel Salazar-Fernandez (current maintainer): epsalazarf@gmial.com
- Dr. Federico Sanchez-Quinto (project leader)