From 1f433232edde97242b3b9867bc700e0fdc343307 Mon Sep 17 00:00:00 2001 From: RuiDong Date: Tue, 6 Jan 2026 23:13:59 -0500 Subject: [PATCH] Update README with clearer themes and explanations Refine themes and explanations in the README to clarify genetic association concepts and modeling approaches. --- README.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f12cb2c..033d6c8 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,13 @@ As an example, consider allele-specific expression (ASE) QTL analysis. Total exp ## Overview of Topics -These notes organize into five themes. The first three represent fundamental ways of thinking about genetic data that recur across many applications. The last two address how we adapt our models to specific data types or practical computational constraints. Throughout, the same building blocks, mostly introduced in our ["statgen-primer" notes](https://statfungen.github.io/statgen-primer), appear in different combinations depending on the scientific question. +These notes organize into five themes. + +The first three themes represent our understanding of how genetic association relates to biological questions. When studying genetic associations, we fundamentally care about two things: mapping specific effects to variants or contexts (Theme 1) or predicting aggregate outcomes from polygenic architecture (Theme 2). Theme 3 takes both perspectives further by adding causal inference: we can ask mapping questions (which gene's expression causes disease?) or prediction questions (can we predict disease risk through genetically predicted expression?). Causality is necessarily narrower than association—it requires stronger assumptions—but builds on the same conceptual foundations. + +The last two address how we adapt our models to specific data types or practical computational constraints. Theme 4 examines how biological data-generating processes inform model design. Unlike simple GWAS phenotypes, molecular data (RNA-seq counts, methylation proportions, splicing ratios) arise from well-understood biological mechanisms that suggest specific distributional choices and model structures. Theme 5 addresses strategic simplification of rigorous generative models for computational tractability. + +Throughout, the same building blocks, mostly introduced in our ["statgen-primer" notes](https://statfungen.github.io/statgen-primer), appear in different combinations depending on the scientific question. ### Theme 1: Mapping of shared vs. specific effects @@ -55,9 +61,11 @@ It is important to clarify that "causal" here refers to the statistical modeling |---------------------|---------------|---------------| | Does gene expression causally affect disease? | TWAS as MR, instrumental variables, predicted expression | PrediXcan, FUSION, MultiXcan, mr.mash, CoMM, cTWAS | | Does exposure X cause outcome Y? | MR assumptions, horizontal pleiotropy, instrument selection | TwoSampleMR, MR-Egger, MR-PRESSO, MRAID | +| Which among multiple correlated exposures is the driver? | Multivariate MR (MVMR), direct vs. indirect effects, pleiotropy correction | MVMR-Egger, MVMR-cML, GRAPPLE, Cis-MRBEE| | Can we distinguish causality from pleiotropy? | Horizontal pleiotropy testing, robust MR | PMR-Egger, CAUSE | | How do we integrate QTL and GWAS evidence for causality? | multi-omics MR | SMR, ... | + ### Theme 4: Generative models for molecular phenotypes This theme examine specific generative modeling to different molecular data types. Unlike GWAS where the phenotype is relatively simple (a quantitative trait or case-control status), molecular phenotypes have complex data generating processes that we understand from biology. Building generative models that respect these structures can improve power and interpretation. @@ -74,7 +82,9 @@ This is the Lego analogy in action: we know the biology of RNA-seq count data or ### Theme 5: Scalability and computational approximations -As genetic datasets grow to biobank scale (hundreds of thousands to millions of individuals), statistical methods often become computationally intractable. This theme addresses practical approximations that particularly enable analysis at scale while preserving the core conceptual framework. + + +Themes 1-4 emphasize building full generative models that capture biological reality. These models are conceptually complete but often computationally intractable at biobank scale. As genetic datasets grow (hundreds of thousands to millions of individuals), statistical methods often become computationally intractable. Theme 5 addresses how we strategically simplify these generative models through approximations that preserve core conceptual framework while enabling computation. The key is understanding what we're dropping and why it still works. We emphasize tradeoffs to helps practitioners choose appropriate methods between rigorous generative models and scalable approximations.