(estimated time: X, required disk space: X, required memory:) Follow the instructions in the StructMAn repository.
(estimated time: X, required disk space: 1 Tb, required memory: 800 Gb) Follow the instructions in the StructGuy repository
git clone https://github.com/kalininalab/StructGuy_evaluation.git
This will create the StructGuy_evaluation directory in the current directory.
First clone our Fork of the MaveTools Package:
git clone https://github.com/AlexanderGress/MaveTools.git
This will create a directory called MaveTools in the current directory, here call the pip installation while having the StructMAn/StructGuy conda environment active (conda activate [environment name (i.e. structman)]):
pip install MaveTools/
Go into the StructGuy_evaluation/training_dataset_generation/ directory and call:
python make_gold_standard.py
(estimated time: X, required disk space: X)
This script will download the MaveDB and ProteinGym substitutions datasets and stores them into StructGuy_evaluation/datasets/. Then it will use those two resources to generate the scaled and filtered goldstandard dataset dedicated for the training of supervised machine learning methods.
Note
To be further utilized by StructGuy, the dataset needs to be featurized.
Whether to train on or to predict a dataset, a respective feature table has to be calculated. The first step to do so is the calculation of structural features by applying the StructMAn annotation pipeline. Therefor a dataset needs to be prepared to be processable by StructMAn, which is explained in this tutorial.
Tip
We provide a toy dataset in StructGuy_evaluation/datasets/Toy_example/toy_example.fasta. It is in a format ready to be processed by StructMAn. It contains the MAVE data for five proteins, the minimal amount of proteins for the five-fold cross-validation in the hyperparameter optimization.
structman -i [path to dataset] -n [number of threads]
-iPath to a StructMAn-readable dataset file.-nProvides the maximal number of threads that should be used.
Tip
For processing the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:
structman -i toy_example.fasta
StructMAn generates a config file named [name of dataset].structguy_project.conf in the corresponding output directory.
It is required for the subsequent callings of StructGuy.
structguy generate_features -i [path to structguy_project.conf] -n [number of threads]
-iPath to the structuguy_project.conf file that got produced by StructMAn.-nProvides the maximal number of threads that should be used.
Tip
For processing the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:
structguy generate_features -i Output/toy_example/toy_example.structguy_project.conf
With and without hyperparameter optimization
Tip
Easiest way to use StructGuy is by downloading the model we trained in (add_link_to_publication_later) from Hugging Face
structguy build_model -i [path to name_of_dataset.structguy_project.conf] --nocv --nohpo --hp [path to a hyperparameter list] -n [number of threads]
-iPath to the structuguy_project.conf file that got produced by StructMAn.--nocvSkips any cross-validation setups and directly trains on the full dataset.--nohpoSkips the hyperparameter optimization.--hpPath to a file with a list of hyperparameters. Optional, if not given, default parameters are taken. An example can be found here:StructGuy_evaluation/configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf-nProvides the maximal number of threads that should be used.
Tip
For training a model on the toy example with the original set of hyperparameters, go to StructGuy_evaluation/datasets/Toy_example/ and call:
structguy build_model -i Output/toy_example/toy_example.structguy_project.conf --nocv --nohpo --hp ../../configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf
Warning
This will consume great amounts of computing resources and time.
structguy build_model -i [path to name_of_dataset.structguy_project.conf] --hp [path to a hyperparameter list] -n [number of threads]
-iPath to the structuguy_project.conf file that got produced by StructMAn.--hpPath to a file with a list of hyperparameters that are used as a start for the hyperparameter optimization. Optional, if not given, default parameters are taken. An example can be found here:StructGuy_evaluation/configs_and_parameters/structguy_hyperparameters_for_goldstandard.conf-nProvides the maximal number of threads that should be used.
Tip
For optimizing the hyperparameters and training a model on the toy example, go to StructGuy_evaluation/datasets/Toy_example/ and call:
structguy build_model -i Output/toy_example/toy_example.structguy_project.conf
structguy predict -i [path to name_of_dataset.structguy_project.conf] -m [path to model.dump file] -n [number of threads]
-iPath to the structuguy_project.conf file that got produced by StructMAn.-mPath to an already trained model, either generated bystructguy build_modelor downloaded from Hugging Face.-nProvides the maximal number of threads that should be used.
Note
This section describes the preparations necessary for the "Comparison to unsupervised model from the ProteinGym benchmark" evaluation from the paper.
This step is computationally expensive and can be omitted, we provide the predictions in StructGuy_evaluation/benchmarks/proteingym_substitutions_predicted_by_structguy.tsv.gz
Go to the StructGuy_evaluation/evaluations/ directory and call:
python prepare_proteingym_for_structguy.py
This downloads the ProteinGym substitutions dataset and generates the StructMAn-readable input file to StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/proteingym_substitutions.fasta
Go to the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/ directory and call:
structman -i proteingym_substitutions.fasta
This will generate an Outfolder directory containing the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf file.
Call the StructGuy feature generation pipeline:
structguy generate_features -i Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf
Now the dataset is fully featurized is ready for the prediction process of StructGuy.
Call the StructGuy prediction pipeline:
structguy predict -i Output/proteingym_substitutions/proteingym_substitutions.structguy_project.conf -m [path to model.dump file]
Note
This section describes the preparations necessary for the "Application to ProteinGym clinical substitutions" evaluation from the paper.
This step is computationally expensive and can be omitted, we provide the predictions in StructGuy_evaluation/benchmarks/proteingym_clinical_substitutions_predicted_by_structguy.tsv.gz
Go to the StructGuy_evaluation/evaluations/ directory and call:
python prepare_clinvar_for_structguy.py
This downloads the ProteinGym clinical substitutions dataset and generates the StructMAn-readable input file to StructGuy_evaluation/datasets/ProteinGym_ClinVar/pg_clinvar.fasta
Go to the StructGuy_evaluation/datasets/ProteinGym_ClinVar/ directory and call:
structman -i pg_clinvar.fasta
This will generate an Outfolder directory containing the StructGuy_evaluation/datasets/ProteinGym_ClinVar/Output/pg_clinvar/pg_clinvar.structguy_project.conf file.
Call the StructGuy feature generation pipeline:
structguy generate_features -i Output/pg_clinvar/pg_clinvar.structguy_project.conf
Now the dataset is fully featurized is ready for the prediction process of StructGuy.
Call the StructGuy prediction pipeline:
structguy predict -i Output/pg_clinvar/pg_clinvar.structguy_project.conf -m [path to model.dump file]
Note
This section corresponds to the "Comparison to unsupervised model from the ProteinGym benchmark" evaluation from the paper.
Perform the steps explained in the Prediction section
Note
If successfully applied, this step generates the StructGuy_evaluation/datasets/proteingym_substitutions_for_structguy/Output/proteingym_substitutions/predictions.tsv file and is used as basis for the evaluation.
If this file is not present, StructGuy_evaluation/benchmarks/proteingym_substitutions_predicted_by_structguy.tsv.gz will be used automatically.
Go to the StructGuy_evaluation/evaluations/ directory and call:
python generalization_vs_unsupervised_benchmark.py
This will generate the generalization_vs_unsupervised.tsv and generalization_vs_unsupervised_old_pg.tsv results tables.
Note
This section corresponds to the "Application to ProteinGym clinical substitutions" evaluation from the paper.
Perform the steps explained in the Prediction section
Note
If successfully applied, this step generates the StructGuy_evaluation/datasets/ProteinGym_ClinVar/Output/pg_clinvar/predictions.tsv file and is used as basis for the evaluation.
If this file is not present, StructGuy_evaluation/benchmarks/proteingym_clinical_substitutions_predicted_by_structguy.tsv.gz will be used automatically.
Go to the StructGuy_evaluation/evaluations/ directory and call:
python evaluate_proteingym_clinvar.py
This will print the average protein-wise AUC value for the predictions from StructGuy on the ProteinGym clinical substitutions dataset into the command prompt.
Note
This section corresponds to the "Comparison to supervised models from the ProteinGym benchmark" evaluation from the paper.
Go to the StructGuy_evaluation/evaluations/ directory and call:
python supervised_benchmark.py [overwrite]
- When called without
overwritethe precalculated results fromStructGuy_evaluation/evaluations/supervised_evalution.tsvare taken. - When called with
overwritethe complete supervised benchmark gets repeated and overwritesStructGuy_evaluation/evaluations/supervised_evalution.tsvin the process.
Warning
Calling this with overwrite will train and test 3255 individual models and is therefor computationally expensive.
The script will generate the StructGuy_evaluation/evaluations/supervised_benchmark_mean_rhos.tsv file that contains the results for the benchmark.