Skip to content

leelabsg/META_SAIGE

Repository files navigation

Meta-SAIGE for Rare Variant Meta-Analysis

Description

Meta-SAIGE is a meta-analysis tool for rare variant association studies. It is designed to combine the results of multiple cohorts and perform a meta-analysis. Meta-SAIGE is built on top of SAIGE/SAIGE-GENE+ and can be used to perform meta-analysis on the summary statistics of SAIGE and LD matrix from SAIGE-GENE+.

Meta-SAIGE was applied to UK Biobank and All of Us data for the meta-analysis. The code for the meta-analysis is available in the GitHub repository under UKB_AllofUS_Meta-analysis.md. The analysis results can be accessed on a PheWEB-like web server at

Workflow Overview

screenshot

Dependencies (R>=4.4.1)

Installation (typically 2~3 minutes)

Important: The SKAT package must be installed from GitHub, not CRAN, as the CRAN version is outdated.

# Install remotes if not already installed
install.packages('remotes')
library(remotes)

# Install SKAT from GitHub (required - do not use CRAN version)
install_github('leelabsg/SKAT')

# Install MetaSAIGE
install_github('git@github.com:leelabsg/SAIGE_META.git')
library(MetaSAIGE)

Summary statistics generation from each cohort

  • input files description

  • GWAS summary : can be obtained from SAIGE

# Example command for SAIGE
# Step 1: Fitting the Null Model

Rscript step1_fitNULLGLMM.R \
    --sparseGRMFile=${sparse_grm_file} \                          # File containing sparse GRM (genomic relatedness matrix)
    --sparseGRMSampleIDFile=${sparse_grm_sample_id_file} \        # File containing sample IDs for the sparse GRM
    --useSparseGRMtoFitNULL=TRUE \                                # Use sparse GRM to fit the null model
    --plinkFile=${plink_prefix} \                                 # Prefix for PLINK files
    --phenoFile=${pheno_file} \                                   # File containing phenotype data
    --phenoCol=${pheno_col} \                                     # Column name of the phenotype in the phenotype file
    --covarColList=${covar_list} \                                # List of covariates to include in the model (comma-separated)
    --sampleIDColinphenoFile=IND_ID \                             # Column name of the sample IDs in the phenotype file
    --traitType=binary \                                          # Type of the trait (binary in this case)
    --nThreads=1 \                                                # Number of threads to use (1 in this case)
    --outputPrefix=${output_dir} \                                # Prefix for output files
    --IsOverwriteVarianceRatioFile=TRUE \                         # Whether to overwrite existing variance ratio files
    --LOCO=TRUE                                                   # Perform Leave-One-Chromosome-Out analysis

# Step 2: Single variant association testing

Rscript step2_SPAtests.R \
    --bedFile=${plink_prefix}.bed \                               # PLINK .bed file 
    --bimFile=${plink_prefix}.bim \                               # PLINK .bim file 
    --famFile=${plink_prefix}.fam \                               # PLINK .fam file 
    --AlleleOrder=alt-first \                                     # Allele order for testing (or ref-first)
    --SAIGEOutputFile=${output_dir}.txt \                         # Output file for SAIGE step2 results
    --chrom=2 \                                                   # Chromosome to analyze (chromosome 2 in this case)
    --minMAF=0 \                                                  # Minimum minor allele frequency (MAF) threshold
    --minMAC=0.5 \                                                # Minimum minor allele count (MAC) threshold
    --is_overwrite_output=TRUE \                                  # Whether to overwrite existing output files
    --GMMATmodelFile=${step1_GMMATmodelFile}.rda \                # File containing the fitted null model from Step 1
    --varianceRatioFile=${step1_varianceRatioFile}.txt \          # File containing the variance ratio from Step 1
    --maxMAF=0.01 \                                               # Maximum minor allele frequency (MAF) threshold
    --is_Firth_beta=TRUE \                                        # Whether to use Firth's correction for rare variants
    --pCutoffforFirth=0.05 \                                      # p-value cutoff for applying Firth's correction
    --is_output_moreDetails=TRUE \                                # Output additional details in the result file (crucial for GC-based SPA tests)
    --max_MAC_for_ER=10 \                                         # Maximum MAC for Efficient Resampling
    --LOCO=TRUE                                                   # Perform Leave-One-Chromosome-Out analysis

LD matrix could be generated by running the following command in SAIGE-GENE+:
step3_LDmat.R \
    --bedFile=${plink_file_prefix}.bed \ #For example, if the plink file is called "example", then the bed file is "example.bed"\
    --bimFile=${plink_file_prefix}.bim \ #For example, if the plink file is called "example", then the bim file is "example.bim"\
    --famFile=${plink_file_prefix}.fam \ #For example, if the plink file is called "example", then the fam file is "example.fam"\
    --AlleleOrder=alt_first \ # or ref_first\
    --SAIGEOutputFile=${SAIGE_output_file} \ #The output directory\
    --chrom=${chromosome_number} \  #The chromosome number \
    --groupFile=${group_file} \ #Group file for gene-based analysis. example provided in extdata/test_input/groupfiles\
    --annotation_in_groupTest=${annotation_file} \ #Functional annotation for the variants of interest. ex. 'missense;lof', 'synonymous;lof'\
    --maxMAF_in_groupTest=0.01 \ #Maximum MAF for group-based analysis ex. 0.01 0.001 0.0001

Step3 generates a sparse LD matrix for each gene using the specified gene_file_prefix. This prefix corresponds to the gene-based LD matrix files and matches the prefix of the marker_info.txt file. Additionally, Step3 produces the marker_info.txt file, which contains variant information within the LD matrix and serves as an input for Meta-SAIGE. Examples are provided in the extdata/test_input directory.

Rscript Usage

Meta-SAIGE can be run using Rscript. The following function is available for running Meta-SAIGE in Rscript (example provided in extdata/test_run.R).

Running Meta-Analysis

Run_MetaSAIGE(n.cohorts, chr, gwas_path, info_path, gene_file_prefix, col_co, output_path, ancestry, trait_type, groupfile, annotation, mafcutoff, pval_cutoff = 0.01, GC_cutoff = 0.05, verbose = TRUE, selected_genes = NULL)
  • n.cohorts : number of cohorts
  • chr : chromosome number
  • gwas_path : path to the GWAS summary. Need to specify GWAS summary file from each and every cohort delimited by white-space (' ')
  • info_path : path to the marker_info.txt file generated from SAIGE 'step3_LDmat.R'. Need to specify marker_info.txt file from each and every cohort delimited by white-space (' ')
  • gene_file_prefix : prefix to the LD matrix separated by genes (also generated from SAIGE step3_LDmat.R) usually same as marker_info.txt file's prefix
  • col_co : ultra-rare variant collapsing cut-off. (default is 10)
  • output_path : path for the meta-analysis resutls
  • verbose(optional): verbose mode. TRUE or FALSE
  • ancestry(optional) : Ancestry indicator (ex. 1 1 1 2). Need to specify ancestry indicator from each and every cohort delimited by white-space (' '). Optional input for multi-ancestry meta-analysis.
  • trait_type : trait type (binary or continuous)
  • groupfile(optional) : path to the group file for gene-based analysis (Same format as SAIGE-GENE+)
  • annotation(optional) : functional annotation for the variants of interest. ex. c('lof', 'missense_lof')
  • mafcutoff(optional) : Maximum MAF for group-based analysis ex. c(0.01 0.001 0.0001)
  • pval_cutoff(optional) : P-value cutoff for SKAT-O. Default is 0.01
  • GC_cutoff(optional) : GC adjustment cutoff. Default is 0.05
  • selected_genes(optional) : Vector of gene names to analyze instead of all genes. This allows you to focus the analysis on specific genes of interest, which can significantly reduce computation time. Example: c("GCK", "APOE", "PCSK9")

CLI Usage

Meta-SAIGE can also be run using the command line interface. The following arguments are available for running Meta-SAIGE in the command line (example provided in extdata/test_run_GC.sh).

Running Meta-Analysis

  • --num_cohorts : number of cohorts
  • --chr : chrmosome number
  • --gwas_path : path to the GWAS summary. Need to specify GWAS summary file from each and every cohort delimited by white-space (' ')
  • --info_file_path : path to the marker_info.txt file generated from SAIGE 'step3_LDmat.R'. Need to specify marker_info.txt file from each and every cohort delimited by white-space (' ')
  • --gene_file_prefix : prefix to the LD matrix separated by genes (also generated from SAIGE step3_LDmat.R) usually same as marker_info.txt file's prefix
  • --col_co : ultra-rare variant collapsing cut-off. (default is 10)
  • output_prefix: directory for output
  • verbose: verbose mode. TRUE or FALSE
  • --ancestry(optional) : Ancestry indicator (ex. 1 1 1 2). Need to specify ancestry indicator from each and every cohort delimited by white-space (' '). Optional input for multi-ancestry meta-analysis.
  • trait_type : trait type (binary or continuous)
  • --groupfile(optional) : Path to the group file. This file should include gene annotations, grouping variants by genes or other relevant units (e.g., UKBexomeOQFE_chr7.gene.anno.hg38_PlinkMatch_v2.txt).
  • --annotation(optional) : Annotation types, typically variant effect categories such as lof (loss of function) or missense_lof (missense and loss of function). Multiple annotations can be specified.
  • --mafcutoff(optional) : Minor allele frequency cutoff values. You can specify multiple thresholds (e.g., 0.01 0.001).
  • --pval_cutoff(optional) : P-value cutoff for SKAT-O. Default is 0.01
  • --GC_cutoff(optional) : GC adjustment cutoff. Default is 0.05
  • --selected_genes(optional) : Comma-separated list of gene names to analyze instead of all genes (e.g., "GCK,APOE,PCSK9"). This allows you to focus the analysis on specific genes of interest, which can significantly reduce computation time.

example commands for GC-based method:
#!/bin/bash
cd META_SAIGE

Rscript inst/scripts/RV_meta_GC.R \
    --num_cohorts 2 \
    --trait_type binary \
    --chr 7 \
    --col_co 10 \
    --info_file_path extdata/test_input/cohort1/LD_mat/cohort1_chr_7.marker_info.txt \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7.marker_info.txt \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7.marker_info.txt \
    \
    --gene_file_prefix extdata/test_input/cohort1/LD_mat/cohort1_chr_7_ \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7_ \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7_ \
    \
    --gwas_path extdata/test_input/cohort1/GWAS_summary/t2d_cohort1_step2_res_7.txt \
    extdata/test_input/cohort2/GWAS_summary/t2d_cohort2_step2_res_7.txt \
    extdata/test_input/cohort2/GWAS_summary/t2d_cohort2_step2_res_7.txt \
    \
    --output_prefix extdata/test_output/GC_t2d_chr7_0.01_missense_lof_res_CLI.txt \
    --verbose TRUE \
    --groupfile extdata/test_input/groupfiles/UKBexomeOQFE_chr7.gene.anno.hg38_PlinkMatch_v2.txt \
    --annotation lof missense_lof \
    --mafcutoff 0.01 0.001

Example: Running Meta-SAIGE for specific genes only

#!/bin/bash
cd META_SAIGE

# Analyze only the GCK gene
Rscript inst/scripts/RV_meta_GC.R \
    --num_cohorts 2 \
    --trait_type binary \
    --chr 7 \
    --col_co 10 \
    --info_file_path extdata/test_input/cohort1/LD_mat/cohort1_chr_7.marker_info.txt \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7.marker_info.txt \
    \
    --gene_file_prefix extdata/test_input/cohort1/LD_mat/cohort1_chr_7_ \
    extdata/test_input/cohort2/LD_mat/cohort2_chr_7_ \
    \
    --gwas_path extdata/test_input/cohort1/GWAS_summary/t2d_cohort1_step2_res_7.txt \
    extdata/test_input/cohort2/GWAS_summary/t2d_cohort2_step2_res_7.txt \
    \
    --output_prefix extdata/test_output/GC_t2d_chr7_GCK_only.txt \
    --verbose TRUE \
    --groupfile extdata/test_input/groupfiles/UKBexomeOQFE_chr7.gene.anno.hg38_PlinkMatch_v2.txt \
    --annotation lof missense_lof \
    --mafcutoff 0.01 0.001 \
    --selected_genes GCK

Output

About

Github for SAIGE_META

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •