Deep functional residue identification.
This repository contains release v1.1 that includes original deepFRI LSTM-GCN architecture but retrained on AlphaFold models with Gene Ontology Uniprot annotations. Note that CNN model hasn't been retrained.
The upgrade allows for much greater coverage of GO-terms as compared to deepFRI v1.0 but requires protein structures as input. Therefore, if you need similar performance but on protein sequence data, you may wish to consider usage of Metagenomic-DeepFRI with deepFRI v1.1 pre-trained models for prediction.
In contrary to original deepFRI (v1.0), the 1.1 release hasn't been trained to predict Enzyme Commission (EC) numbers.
Number of predicted GO-terms in a given information content range for deepFRI v1.0, v.1.1 and v1.1. without GO-term propagation
@article {Gligorijevic2019,
author = {Gligorijevic, Vladimir and Renfrew, P. Douglas and Kosciolek, Tomasz and Leman,
Julia Koehler and Cho, Kyunghyun and Vatanen, Tommi and Berenberg, Daniel
and Taylor, Bryn and Fisk, Ian M. and Xavier, Ramnik J. and Knight, Rob and Bonneau, Richard},
title = {Structure-Based Function Prediction using Graph Convolutional Networks},
year = {2019},
doi = {10.1101/786236},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2019/10/04/786236},
journal = {bioRxiv}
}
@article {Szczerbiak2024.08.14.607935,
author = {Szczerbiak, Pawe{\l} and Szydlowski, Lukasz and Wydma{\'n}ski, Witold and Douglas Renfrew, P. and Leman, Julia Koehler and Kosciolek, Tomasz},
title = {Large protein databases reveal structural complementarity and functional locality},
elocation-id = {2024.08.14.607935},
year = {2024},
doi = {10.1101/2024.08.14.607935},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/08/17/2024.08.14.607935},
eprint = {https://www.biorxiv.org/content/early/2024/08/17/2024.08.14.607935.full.pdf},
journal = {bioRxiv}
}
The easiest way to install deepFRI is to use Conda:
conda create -n deepfri_env --file requirements.txt
Please check out the requirements.txt and comment out unnecessary packages (see details in the file).
For quick predictions (with up to a few % drop in accuracy) you can run deepFRI v1.1 on CPU using predict_fast.py with the following options:
--inputor-i, str, path to PDB file--input_fileor-f, str, path to folder with PDB files--seqor-s, str, input sequence--fasta_fn, str, path to fasta file with input sequences--output_folderor-o, str, path to output folder (each prediction is saved to a separate output file asPDBNAME_MODELNAME.csv)--verboseorv, whether to print prediction on the screen--modelsor-m, str, models to use (mf- Molecular Function,bp- Biological Process,cc- Cellular Component)--propagateor-p, whether to propagate output predictions using GO graph
NOTE 1: options 1., 2., 3., 4. are mutually exclusive.
NOTE 2: if a sequence is provided as input, ESMFold API is used to predict structure.
deepFRI v1.1 has been trained on a GPU and the resulted models requires GPU for inference. If you don't have GPU please go to CPU section.
To predict protein functions use predict.py script with the following options:
seqstr, Protein sequence as a stringcmapstr, Name of a file storing a protein contact map and sequence in*.npzfile format (with the following numpy array variables:C_alpha,seqres. Seeexamples/pdb_cmaps/)pdbstr, Name of a PDB file (cleaned)pdb_dirstr, Directory with cleaned PDB files (seeexamples/pdb_files/)cmap_csvstr, Filename of the catalogue (in*.csvfile format) containg mapping between protein names and directory with*.npzfiles (seeexamples/catalogue_pdb_chains.csv)fasta_fnstr, Fasta filename (seeexamples/pdb_chains.fasta)model_configstr, JSON file with model filenames (seetrained_models/)ontstr, Ontology (mf- Molecular Function,bp- Biological Process,cc- Cellular Component)output_fn_prefixstr, Output filename (sampe prefix for predictions/saliency will be used)verbosebool, Whether or not to print function prediction resultssaliencybool, Whether or not to compute class activaton maps (outputs a*.jsonfile)
Generated files (see examples/outputs/):
output_fn_prefix_MF_predictions.csvPredictions in the*.csvfile format with columns: Protein, GO-term/EC-number, Score, GO-term/EC-number nameoutput_fn_prefix_MF_pred_scores.jsonPredictions in the*.jsonfile with keys:pdb_chains,Y_hat,goterms,gonamesoutput_fn_prefix_MF_saliency_maps.jsonJSON file storing a dictionary of saliency maps for each predicted function of every protein
deepFRI offers 6 possible options for predicting functions. See examples below.
Example: predicting MF-GO terms for Parvalbumin alpha protein using its sequence and contact map (PDB: 1S3P):
>> python predict.py --cmap ./examples/pdb_cmaps/1S3P-A.npz -ont mf --verbose
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion bindingExample: predicting MF-GO terms for Parvalbumin alpha protein using its sequence (PDB: 1S3P):
>> python predict.py --seq 'SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKDGFIDEDELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES' -ont mf --verbose
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99769 calcium ion binding>> python predict.py --fasta_fn examples/pdb_chains.fasta -ont mf -v
Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99769 calcium ion binding
2J9H-A GO:0004364 0.46937 glutathione transferase activity
2J9H-A GO:0016765 0.19910 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2J9H-A GO:0097367 0.10537 carbohydrate derivative binding
2PE5-B GO:0003677 0.53502 DNA binding
2W83-E GO:0032550 0.99260 purine ribonucleoside binding
2W83-E GO:0001883 0.99242 purine nucleoside binding
2W83-E GO:0005525 0.99231 GTP binding
2W83-E GO:0019001 0.99222 guanyl nucleotide binding
2W83-E GO:0032561 0.99194 guanyl ribonucleotide binding
2W83-E GO:0032549 0.99149 ribonucleoside binding
2W83-E GO:0001882 0.99135 nucleoside binding
2W83-E GO:0017076 0.98687 purine nucleotide binding
2W83-E GO:0032555 0.98641 purine ribonucleotide binding
2W83-E GO:0035639 0.98611 purine ribonucleoside triphosphate binding
2W83-E GO:0032553 0.98573 ribonucleotide binding
2W83-E GO:0097367 0.98168 carbohydrate derivative binding
2W83-E GO:0003924 0.52355 GTPase activity
2W83-E GO:0016817 0.36863 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016818 0.36683 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0017111 0.35465 nucleoside-triphosphatase activity
2W83-E GO:0016462 0.35303 pyrophosphatase activity>> python predict.py --cmap_csv examples/catalogue_pdb_chains.csv -ont mf -v
Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99824 calcium ion binding
2J9H-A GO:0004364 0.84826 glutathione transferase activity
2J9H-A GO:0016765 0.82014 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2PE5-B GO:0003677 0.89086 DNA binding
2PE5-B GO:0017111 0.12892 nucleoside-triphosphatase activity
2PE5-B GO:0004386 0.12847 helicase activity
2PE5-B GO:0032553 0.12091 ribonucleotide binding
2PE5-B GO:0097367 0.11961 carbohydrate derivative binding
2PE5-B GO:0016887 0.11331 ATPase activity
2W83-E GO:0097367 0.97069 carbohydrate derivative binding
2W83-E GO:0019001 0.96842 guanyl nucleotide binding
2W83-E GO:0017076 0.96737 purine nucleotide binding
2W83-E GO:0001882 0.96473 nucleoside binding
2W83-E GO:0035639 0.96439 purine ribonucleoside triphosphate binding
2W83-E GO:0032555 0.96294 purine ribonucleotide binding
2W83-E GO:0016818 0.96181 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0032550 0.96142 purine ribonucleoside binding
2W83-E GO:0016817 0.96082 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016462 0.95998 pyrophosphatase activity
2W83-E GO:0032553 0.95935 ribonucleotide binding
2W83-E GO:0032561 0.95930 guanyl ribonucleotide binding
2W83-E GO:0032549 0.95877 ribonucleoside binding
2W83-E GO:0003924 0.95453 GTPase activity
2W83-E GO:0001883 0.95271 purine nucleoside binding
2W83-E GO:0005525 0.94635 GTP binding
2W83-E GO:0017111 0.93942 nucleoside-triphosphatase activity
2W83-E GO:0044877 0.64519 protein-containing complex binding
2W83-E GO:0001664 0.31413 G protein-coupled receptor binding
2W83-E GO:0005102 0.20078 signaling receptor binding>> python predict.py -pdb ./examples/pdb_files/1S3P-A.pdb -ont mf -v
Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion binding>> python predict.py --pdb_dir ./examples/pdb_files -ont mf --saliency --use_backprop
See files in: examples/outputs/
Data (train and validation) used for training deepFRI v1.1 model are provided as TensorFlow-specific TFRecord and require a few TB of disk space. If you wish to get access to that files, please write to: p.szczerbiak@sanoscience.org.
Pretrained CPU models can be downloaded from figshare. Use these models if you run deepFRI v1.1 on CPU with predict_fast.py.
Uncompress the content of deepFRI_v11_trained_models_CPU.zip file into the deepFRI directory with model weights i.e. /path/to/DeepFRI/trained_models.
