Skip to content

CompBioIPM/scGATE

Repository files navigation

scGATE: single-cell gene regulatory gate inference

We have developed scGATE (single-cell gene regulatory gate) as a logic-based model for deciphering tissue- and cell-type-specific gene regulatory networks from single-cell RNA sequencing (scRNA-seq). While previous efforts have focused on reconstructing directed transcription factor (TF) to target gene networks, logic-based models enable the exploration of more complex combinatorial relationships between regulators. In particular, Boolean logic models can capture higher-order TF interactions, represented through AND, OR, and XOR logic gates. Our novel approach infers TF-gene networks from scRNA-seq data while simultaneously elucidating the underlying Boolean logic that combines TF activities. The methodology integrates external genomic data, such as single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) and motif analysis, to narrow down candidate TFs for each target gene. TF-gene links are further refined using scRNA-seq data, and logic rules are derived. To enhance statistical power, scGATE focuses on reconstructing context specific regulatory networks, including tissue- or cell-type-specific networks. This study presents an integrative framework for deducing cell-type-specific gene regulation, moving beyond TF-gene pairs to capture the complex logic operations underlying combinatorial control.


Malekpour, S.A., Haghverdi, L., Sadeghi, M., Single-cell multi-omics analysis identifies context specific gene regulatory gates and mechanisms. Briefings in Bioinformatics 25 (3), 2024, bbae180.
If you find our study useful and relevant to your research, please kindly cite us. Your citation means a lot to us and helps acknowledge our contributions.


Fig1

Step 1. scGATE installation

The scGATE codes are written in R version 4.1.3 and have been tested in both Windows and Linux environments.

Installation

  1. Download the compiled package file scGATE_0.1.0.tar.gz from this GitHub page.

  2. Install the scGATE package by running the following command in R:

    install.packages("path/to/scGATE_0.1.0.tar.gz", repos = NULL, type = "source")

Dependencies

Please ensure that you have the following packages installed:

install.packages("VGAM")  
install.packages("truncnorm")
install.packages("arrow")
install.packages("doParallel")
install.packages("foreach")
install.packages("doSNOW")

In order to run scGATE with parallel computing, the packages doParallel, foreach, and doSNOW need to be installed.

To load the packages, use the following commands:

library(scGATE)  
library(VGAM)  
library(truncnorm)  
library(arrow) 

Step 2. Prepare input files

Preprocessing base GRN generated from external hints

To summarize information in the base GRN file in ".parquet" format, previously generated using external hints like scATAC-seq and TF binding motif analyses, you can use the read_base_GRN() function from the scGATE package.

# Read and summarize base GRN file
candidate_tf_target <- as.data.frame(read_parquet("Buenrostro2018_base_GRN_dataframe.parquet"))
candidate_tf_target <- read_base_GRN(candidate_tf_target)

Preprocessing scRNA-seq count data

To preprocess raw scRNA-seq data, including steps such as normalization and rescaling, you can use the scRNA_seq_preprocessing() function from the scGATE package.

# Preprocess scRNA-seq count data
normalized_counts <- scRNA_seq_preprocessing(data = data_scRNA_seq, library_size_normalization = "True", tf_list = NA)

Parameter Descriptions

# data                       The scRNA-seq raw data matrix with cells in rows and genes in columns.

# library_size_normalization A flag indicating whether library size normalization should be performed.
#                            The default value is "True".
#                            Set it to "False" if you don't want to perform library size normalization.
  
# tf_list                    A list of transcription factors (TFs) to consider.
#                            The default value is NA, which means all columns in the data matrix will be considered as TFs.  

Step 3. Run scGATE

scGATE provides two functions for TF-target network inference: scGATE_gate() and scGATE_edge(). These functions infer the TF-target network with and without predicted Boolean logic gates in the output, respectively. The scGATE_gate() function in the scGATE package is more suitable for small networks or when the base gene regulatory network (GRN) is available from external sources such as scATAC-seq and TF motif data.

TF-Target Network Inference (gate mode)

To infer the TF-target network with logic gates in the output, you can use the scGATE_gate() function.

# Infer TF-target network without logic gates in the output
gates <- scGATE_logic(data = data, base_GRN = NA, h_set = NA, number_of_em_iterations = NA, max_num_regulators = NA, abs_cor = NA, top_gates = NA, run_mode = NA)
print(head(gates))

Parameter Descriptions

# data                    A gene expression matrix with normalized counts within the (0,1) interval,
#                         where samples are represented as rows and genes as columns.
#                         The gene expression matrix should have been preprocessed using the scRNA_seq_preprocessing() function.

# base_GRN                Base TF-gene interaction network derived from external hints
#                         (e.g., scATAC-seq data and TF binding site motifs on DNA).
#                         Leave it empty if no base GRN is available.

# h_set                   The range of possible values for the "h" parameter in the Hill climbing function.
 
# number_of_em_iterations The number of iterations in the expectation-maximization (EM) algorithm.
#                         The default value is 3.

# max_num_regulators      Maximum number of TFs in a logic gate that can regulate the target gene profile.
#                         The default value is 3.

# abs_cor                 This parameter varies in the (0, 1) interval and further removes edges with low absolute Pearson correlations between TFs and their targets.
#                         A (default) value of 0 indicates no filtration based on correlations.
  
# top_gates               The number of top Boolean logic gates to be reported for each target gene, based on Bayes Factor.
#                         The default value is 1.
  
# run_mode                Use "simple" for a faster algorithm run and "complex" for more precise results that take more time.
#                         The argument is relevant to the possible complexities in the hill function parameter space for regulatory TFs and target genes.
#                         The default value is "simple".

# weight_threshold        The output form scGATE will present the logic combination or partition that yields a certain percentage of the target gene,
#                         specifically when it is above the weight_threshold.
#                         The default value is 0.05.

# num_cores               Specify the number of parallel workers (adjust according to your system).

TF-Target Network Inference (edge mode)

To infer the TF-target network without logic gates in the output, you can use the scGATE_edge() function.

# Infer TF-target network without logic gates in the output
edges <- scGATE_edge(data = data, base_GRN = candidate_tf_target, h_act = NA, number_of_em_iterations = NA, max_num_regulators = NA, abs_cor = NA)
print(head(edges))

Parameter Descriptions

# data                    A gene expression matrix with normalized counts within the (0,1) interval,
#                         where samples are represented as rows and genes as columns.
#                         The gene expression matrix should have been preprocessed using the scRNA_seq_preprocessing() function.
  
# base_GRN                Base TF-gene interactions derived from external hints
#                         (e.g., scATAC-seq data and TF binding site motifs on DNA).
#                         Leave it empty if no base GRN is available.

# h_act                   Parameter of the Hill climbing function.
#                         It is the hill coefficient that represents the cooperativity or sigmoidicity of the TF regulatory response.
#                         The default value is 7.
  
# number_of_em_iterations The number of iterations in the expectation-maximization (EM) algorithm.
#                         The default value is 3.
 
# max_num_regulators      Maximum number of TFs in a logic gate that can regulate the target gene profile.
#                         The default value is 3.

# abs_cor                 This parameter varies in the (0, 1) interval and further removes edges with low absolute Pearson correlations between TFs and their targets.
#                         A (default) value of 0 indicates no filtration based on correlations.

# num_cores               Specify the number of parallel workers (adjust according to your system) 

Example usage of scGATE

I. Context specific network and logic gate inference in synthetic toggle switch

# 1. Please refer to the Jupyter notebook for instructions on how to perform Louvain clustering on the cells in the BoolODE simulated data.
# 2. Retrieve the data from Cluster I of cells, which was obtained in the previous step.
# Load scGATE package and data in example_data folder
 
rm(list = ls())
library(scGATE)
data <- as.matrix(read.csv("/example_data/ClusterI.csv")[ ,2:15])
print(head(data))
             gA       gB         gC        gC1        gC2         gD        gD1        gD2       gE        gE1      gE2         gF        gF1        gF2
[1,] 0.02764677 2.028944 0.01688577 0.01946526 0.02380772 0.01852824 0.02069895 0.02093184 1.932168 0.06889533 1.824497 0.04963150 0.05794413 0.04217521
[2,] 0.02643986 2.027956 0.01730882 0.01963009 0.02459190 0.01834800 0.01909300 0.02050692 1.965628 0.06075294 1.829349 0.04624483 0.04405836 0.02271849
[3,] 0.02593749 2.033592 0.01729984 0.01796269 0.02422372 0.01760521 0.01906345 0.02125864 1.976888 0.04514580 1.817391 0.03175314 0.03738465 0.02141319
[4,] 0.02595885 2.019971 0.01756862 0.01787157 0.02412755 0.01791644 0.02112435 0.02106185 1.980759 0.03720293 1.836962 0.02033092 0.03677651 0.01974638
[5,] 0.02629885 2.015461 0.01753645 0.01921252 0.02491328 0.01909465 0.02101008 0.02132906 1.986872 0.03738554 1.837226 0.01999704 0.03683252 0.01955996
[6,] 0.02640293 2.009388 0.01748322 0.02028304 0.02449585 0.01990073 0.02096272 0.02065111 1.982286 0.03776144 1.834406 0.02226978 0.03658167 0.01937109
# 3. data preprocessing 
# For scGATE simulated data, library size normalization is not performed. 
# However, the simulated data is only re-scaled using the quantile normalization technique to fit the data within the (0,1) interval.
data <- scRNA_seq_preprocessing(data = data, library_size_normalization = "False")
# 4. Remove genes with low variability (scGATE operates on highly variable genes per context).
# This step is optional
data$n_counts <- data$n_counts[ , which(sqrt(apply(data$n_counts,2,var))> 0.20)]
# 5. Run scGATE_logic() function
# Please note that the likelihood values can be affected by the Louvain clustering results.
gates <- scGATE_logic(data = data, top_gates = 1, run_mode = "simple")
print(head(gates))
  gene_name -log10 L0 -log10 L1 log10 BF logic_gate
1        gE     173.9   -268.57   442.47        ~gF
2       gE1     51.85   -234.65   286.50    gE.~gE2
3       gE2     38.43   -235.48   273.91    gE.~gE1
4        gF    170.38   -278.57   448.95        ~gE
5       gF1     80.36   -215.32   295.68    gF.~gF2
6       gF2      67.6   -217.88   285.48    gF.~gF1

II. Context specific network and logic gate inference in the mouse haematopoiesis scRNA-seq data

# 1. Please refer to the Jupyter notebook for instructions on how to perform Louvain clustering on the cells in the mouse haematopoiesis scRNA-seq dataset.
# 2. Retrieve the data from Megakaryocyte cells (Cluster 11).
# Load scGATE package and data in example_data folder

rm(list = ls())
library(scGATE)
data <- as.data.frame(read.csv("/example_data/subset_counts_cluster_11.csv" , header = TRUE))

# select genes involved in the MegE differentiation
gene_list <- c("Gata1", "Fli1", "Klf1", "Spi1", "Zfpm1", "Tal1", "Gata2")
data <- data[  , gene_list]
data <- na.omit(data)
print(head(data))
      Gata1      Fli1     Klf1      Spi1     Zfpm1      Tal1     Gata2
1 0.6931472 1.0986123 0.000000 0.6931472 0.0000000 0.6931472 0.0000000
2 0.0000000 1.3862944 0.000000 0.0000000 0.0000000 0.6931472 1.0986123
3 0.6931472 1.6094380 0.000000 0.0000000 0.0000000 0.0000000 0.6931472
4 0.0000000 0.0000000 1.098612 0.0000000 0.6931472 0.0000000 1.6094380
5 0.0000000 0.0000000 0.000000 0.0000000 0.6931472 0.6931472 1.3862944
6 0.0000000 0.6931472 0.000000 0.0000000 0.6931472 1.0986123 0.0000000

# Load base GRN
base_GRN <- read.csv("/example_data/base_grn_mouse_blood_cell_differentiation_toggle_switch.csv")
# 3. data preprocessing 
# The dataset underwent library size normalization in Jupyter Notebook. To fit the scRNA-seq data within the (0,1) interval, we applied quantile normalization as a technique to rescale the data.
data <- scRNA_seq_preprocessing(data = data, library_size_normalization = "False")
# 4. Run scGATE_logic() function
gates <- scGATE_logic(data = data, base_GRN = base_GRN, number_of_em_iterations = 10, top_gates = 1, run_mode = "complex")
print(head(gates))


# To effectively derive Boolean rules from extensive scRNA-seq datasets containing over 10 TFs,
# we recommend employing scGATE with the following configuration.
gates <- scGATE_logic(data = data, base_GRN = base_GRN, number_of_em_iterations = 10, top_gates = 50, run_mode = "simple")
print(gates)

# or you may use,
h_set <- c(1.25, 2.25)
gates <- scGATE_logic(data = data, base_GRN = base_GRN, h_set = h_set, number_of_em_iterations = 10, max_num_regulators = 2, top_gates = 50, run_mode = "complex")
print(gates)

III. Context specific network inference in mouse tissue scRNA-seq datasets

# 1. Please refer to the Jupyter notebook for instructions on how to perform scATAC-seq analysis to derive the candidate TF lists (base GRNs) in *.parquet file format.
# 2. Load scGATE package and data (base GRN and scRNA-seq data and TF list) in example_data folder 

rm(list=ls())
library(scGATE)
library(arrow)

# Load base GRN derived from external hints
candidate_tf_target <- as.data.frame(read_parquet("/example_data/Cusanovich2018_Spleen_peak_base_GRN_dataframe.parquet"))
candidate_tf_target <- read_base_GRN(candidate_tf_target)

# Load scRNA-seq data
data <- as.data.frame(read.csv("/example_data/Tabula_Muris2018_Spleen-10X_P4_7_ExpressionData.csv" , header = TRUE))
gene_names <- data[ ,1]
data <- t(data[ ,2:ncol(data)])
colnames(data) <- gene_names

head(data[ , 1:10])
                   Batf Stat5b Ctcf H2-Eb1 AW112010 Ly6d Rplp0 Id2 Dok2 Gimap3
AAACCTGAGAAGGACA.1    0      0    0     18        0    0    10   0    0      0
AAACCTGAGCTAAGAT.1    0      0    1      0       19    0     5   1    1      1
AAACCTGCAACAACCT.1    0      0    0     22        0    5    12   0    0      2
AAACCTGCAGCCAATT.1    0      0    0     14        1    5    21   0    0      1
AAACCTGCAGCTCCGA.1    0      0    1     30        1    2    64   0    0      0
AAACCTGTCAGGTAAA.1    0      0    0     23        3    8    24   0    0      0


# Load TF list
# This step is optional
tf_names <- unlist(read.table("/example_data/Tabula_Muris2018_Spleen-10X_P4_7_tf_lists.txt"))
print(head(tf_names))
      V1       V2       V3 
  "Batf" "Stat5b"   "Ctcf"
# 3. scRNA-seq data preprocessing (library size normalization, quantile normalization technique to fit the scRNA-seq data within the (0,1) interval) 
data <- scRNA_seq_preprocessing(data = data, library_size_normalization = "True", tf_list = tf_names)
# 4. Run scGATE_edge() function
ranked_edge_list <- scGATE_edge(data = data, base_GRN = candidate_tf_target, h_act = 7)
print(head(ranked_edge_list))
    from    to BF_score
1   Ctcf Rps19 2002.419
2   Batf Rps19 2001.388
3 Stat5b Rplp0 1840.046
4   Ctcf Rplp0 1839.610
5   Ctcf Rpl36 1639.910
6   Ctcf Eif5a 1550.267

IV. Context specific network inference in human haematopoiesis scRNA-seq dataset

# 1. Please refer to the Jupyter notebook for instructions on how to perform scATAC-seq analysis to derive the candidate TF lists (base GRNs) in *.parquet file format.
# 2. Load scGATE package and data (base GRN and scRNA-seq data and TF list) in example_data folder 

rm(list=ls())
library(scGATE)
# Load base GRN derived from external hints
candidate_tf_target <- as.data.frame(read_parquet("/example_data/Buenrostro2018_base_GRN_dataframe.parquet"))
candidate_tf_target <- read_base_GRN(candidate_tf_target)

# Load scRNA-seq data
data <- as.data.frame(read.csv("/example_data/Buenrostro2018_ExpressionData.csv" , header = TRUE))
gene_names <- data[ ,1]
data <- t(data[ ,2:ncol(data)])
colnames(data) <- gene_names

head(data[ , 1:10])
      IRF8 FOS MAFF SPI1 JUNB SPIB IRF7 TFDP1 GATA1 RAD21
hsc_1    0   2    0    0    2    0    0     0     0     1
hsc_2    0   6    7    0    3    0    0     0     0     1
hsc_3    0   2    0    0    5    0    0     0     0     2
hsc_4    0   6    0    0    1    0    0     1     0     1
hsc_5    0   1    5    2    1    0    0     0     0     0
hsc_6    0   3    0    0    1    0    0     0     0     0

# Load TF list
# This step is optional
tf_names <- unlist(read.table("/example_data/Buenrostro2018_tf_lists.txt"))
print(head(tf_names))
    V1     V2     V3     V4     V5     V6 
"IRF8"  "FOS" "MAFF" "SPI1" "JUNB" "SPIB" 
# 3. scRNA-seq data preprocessing (library size normalization, quantile normalization technique to fit the scRNA-seq data within the (0,1) interval)
data <- scRNA_seq_preprocessing(data = data, library_size_normalization = "True", tf_list = tf_names)
# 4. Run scGATE_edge() function
ranked_edge_list <- scGATE_edge(data = data, base_GRN = candidate_tf_target, h_act = 7)
print(head(ranked_edge_list))
     from     to BF_score
1    E2F1 MALAT1 13094.06
2 BHLHE40 MALAT1 13094.05
3   TFDP1 MALAT1 13093.97
4    NFE2 MALAT1 13092.80
5    IRF8 MALAT1 13091.92
6    E2F1   PTMA 11204.92

Datasets

Raw scRNA-seq data for cell-type specific logic gate inference in the mouse haematopoiesis dataset (Fig.3) is also available at Zenodo https://doi.org/10.5281/zenodo.8353409.
The base GRNs reconstructed with scATAC-seq and TF binding site motif, in mouse tissue and human haematopoiesis datasets, together with other intermediate and processed files are available at Zenodo https://doi.org/10.5281/zenodo.8353409.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published