A study demonstrated that Linear Mixed Models (LMMs) have improved power in detecting effects when compared to linear regression models(LMs). However, this study does not extend its comparison to the inflation of Type I errors between these methodologies. Given the trade-off between Type I and Type II errors in statistical analyses, the absence of such a comparison leaves a gap in our understanding. Therefore, our project proposes an investigation into the statistical errors by LMMs and LMs. By systematically comparing the Type I and Type II error rates of both models, this project aims to illuminate their statistical efficacy and robustness.
Since the genotype data is not available, we will simulate the genotype data using plink software. The genotype data will be simulated for 100, 1k, 10k samples and linkage equlibrium 10k SNPs. The genotype data will be simulated using the following command:
bash simulate_bed_files.sh $BASEDIR
Genetic PCA was performed to generate the covariate file for the linear regression model. The genetic PCA was performed using the following command:
bash generate_pcs.sh $BASEDIR
Two different sets of phenotypes were generated for each sample size. The first set of phenotypes was generated for type 1 error evaluation and the second set of phenotypes was generated for type 2 error evaluation. All phenotypes were generated using the following command:
bash phenotype_generation.sh
100 null phenotypes for type 1 error evaluation were generated using phenotype_generation_1.R script.
-argparser
-data.table
-fam : plink fam file
-n_phenotypes : number of phenotypes to generate
-out : output file prefix
100 causal phenotypes for type 2 error evaluation were generated using phenotype_generation_2.R script.
-argparser
-data.table
-seqminer
-bfile : plink binary bed file
-n_phenotypes : number of phenotypes to generate
-n_negative_causal_variants : number of causal variants to generate
-n_positive_causal_variants : number of causal variants to generate
-negative_beta_min : range of beta values
-negative_beta_max : range of beta values
-positive_beta_min : range of beta values
-positive_beta_max : range of beta values
-out : output file prefix
The simple linear regression model is implemented in the attached LR.R file. The file is implemented using the following dependencies
-argparser
-data.table
-dplyr
-seqminer
-bfile : plink file prefix'
-pheno : phenotype file
-covar : covariate file
-out : output file prefix
Rscript /home/eup009/cse284/LinearRegression/LR.R \
--bfile /home/eup009/cse284/CSE_284_Sim/simulate.100 \
--covar /home/eup009/cse284/CSE_284_Sim/simulate.100.eigenvec \
--pheno /home/eup009/cse284/CSE284_phenotype/100_samples/100_samples_set_1.pheno \
--out /home/eup009/cse284/output/100/LM_res_seed$1
To test the Linear Mixed Model, we used the EMMAX software. EMMAX is a mixed model association test that accounts for population structure and relatedness. EMMAX uses a linear mixed model to account for the relatedness between individuals. The EMMAX software was used to compute the association study across all simulated phenotypes.
Pre-compiled executable source: https://csg.sph.umich.edu//kang/emmax/download/index.html
bash make_IBS_matrices.sh $BASEDIR $EMMAX_KIN_PATH $OUTBASE
bash run_association.sh $BASEDIR $PHENOBASE $OUTBASE $EMMAX_KIN_PATH $EMMAX_PATH
The type I error evaluation was performed by the False Positive Rate (FPR) of the linear regression model and the linear mixed model. Furthermore, the p-value distribution was plotted in a qqplot to visualize the inflation of the p-values. qqplot was generated using the plot_res.R script.
bash type1_script.sh $BASEDIR $PHENOBASE $OUTBASE ./
Rscript plot_res_LR.R $OUTBASE/output $OUTBASE/plots
Rscript plot_res_emmax.R $OUTBASE/output $OUTBASE/plots
bash type2_script.sh $BASEDIR $PHENOBASE $OUTBASE ./
Rscript plot_power.R $OUTBASE/output $OUTBASE/plots