EssSubgraph: An inductive representation learning method that integrates graph-structured network data with omics features
EssSubgraph is a predictive framework designed to identify essential genes in mammals by integrating gene expression data with large-scale biological networks. The core idea is to extract subgraphs related to gene essentiality from multi-layer interaction networks and apply graph neural networks to learn informative representations for prediction.
The following depicts a broad overview over the EssSubgraph method.
The code is written in Python 3 and was mainly tested on Python 3.8 and a Linux OS but should run on any OS that supports python and pip. Training is faster on a GPU.
EssSubgraph has the following dependencies:
- Numpy
- Pandas
- torch
- Networkx
- scipy
- seaborn
- scikit-learn
- torch-geometric
Build conda environment
conda create --name py38 -c conda-forge python=3.8
conda activate py38
Dependencies can be installed using the following command:
pip install -r requirements.txt
EssSubgraph was tested with 7 different protein-protein interaction (PPI )networks, namely:
The network was constructed using the tutorial from Network Evaluation Tools.
The gene expression data (TCGA RNA-Seq normalized RSEM data) was obtained from Albino Bacolla.
python generate_pca.py
python build_dataset_container.py \
--network ./data/string_net.txt \
--essential ./data/Essential_genes \
--nonessential ./data/Non_essential_genes \
--features ./data/cancer_full_expression_pc50.csv \
--output esssubgraph_human_pc50_string.pkl
The detailed descriptions about the arguments are as following:
| Parameter name | Description |
|---|---|
--network |
Path to the network file (e.g., /path/to/string_net.txt). Specifies the gene interaction network to process. |
--essential |
Path to the essential genes file (e.g., ../data/Essential_genes). Lists genes critical for cell survival. |
--nonessential |
Path to the non-essential genes file (e.g., ../data/Non_essential_genes). Lists non-critical genes. |
--features |
Path to the gene feature CSV file. Contains node feature data (e.g., gene expression PC50 features). |
--output |
Output pickle file name for the PyTorch Geometric dataset. The network name is appended (e.g., esssubgraph_human_pc50_string.pkl). |
python EssSubgraph.py --epochs 200 --device 0 --dataset ./data/esssubgraph_human_pc50_string.pkl
The detailed descriptions about the arguments are as following:
| Parameter name | Description of parameter |
|---|---|
| --dataset | The path of the input pkl file |
| --epochs | Number of epochs to train the model (defaults to 200) |
| device | Device id of gpus (defaults to 0) |
To ensure reproducibility, build and run the project with Docker:
#Build Docker Image
docker build -t esssubgraph .
#Run Docker Container
docker run -it -v $(pwd):/app esssubgraph
To reproduce performance comparisons with other models, scripts under /baseline can be used.
GNU General Public License v3.0 (see LICENSE).
If you have any questions, feel free to contact me through Email (dal462929@utdallas.edu) or Github issues. Pull requests are highly welcome!
