FusionProp is a web-based platform designed for rapid and accurate prediction of multiple key protein properties, including solubility, thermostability, and toxicity. It leverages deep learning models and features from advanced Protein Language Models (PLMs) to make predictions directly from amino acid sequences, eliminating the need for 3D protein structures.
- Multi-Task Prediction: Obtain predictions for solubility, thermostability, and toxicity simultaneously in a single run.
- Structure-Free: Predictions are entirely sequence-based, avoiding time-consuming and computationally intensive protein structure prediction steps.
- High Performance: Utilizes optimized models and asynchronous task processing for fast predictions, suitable for high-throughput screening.
- High Accuracy: Models achieve or approach state-of-the-art performance on benchmark datasets.
- User-Friendly: Provides a clean web interface supporting direct sequence pasting or FASTA file uploads, with clear result presentation and download options.
- Modular Design: Includes separate modules for feature extraction, model training, and web application services.
- Backend: Django
- Asynchronous Tasks: Celery, Redis (as message broker and result backend)
- Machine Learning/Deep Learning: PyTorch
- Protein Language Models:
- ESM (e.g., Facebook's ESM-2 series)
- ESMC
- Feature Extraction: Uses the
transformerslibrary (for ESM-2) andesm(from the OpenFold team, for ESMC) - Database: SQLite (default, configurable to PostgreSQL, etc.)
- Web Server/Deployment: Gunicorn, Docker, Nginx (recommended for reverse proxy in production)
- Frontend: HTML, CSS, JavaScript, Bootstrap
fusionprop/
├── data/ # Stores training, testing datasets, and related raw data
├── extract_features/ # Contains scripts for extracting protein features using different PLMs
│ ├── extract_esm_1.py # Extracts features using ESM-2
│ ├── extract_esmc_1.py # Extracts features using ESMC
│ └── ... # Other feature extraction scripts and related shell scripts
├── train_script/ # Contains scripts for training models for different property predictions
│ ├── solubility/ # Solubility model training scripts (e.g., fusion_5_5_4_2_3.py)
│ ├── thermostability/ # Thermostability model training scripts (e.g., train_22_1_1.py)
│ └── toxicity/ # Toxicity model training scripts (e.g., train_12_2.py, evaluate_model.py)
├── web/ # Core code for the Django web application and API (main body of the original project)
│ ├── Dockerfile
│ ├── docker-compose.yml
│ ├── manage.py
│ ├── predictor/ # Django app handling prediction logic, forms, views, tasks
│ ├── protein_feature_extractor/ # Feature extractor management module
│ ├── protein_predictor/ # Management and implementation of various prediction models
│ ├── protein_webapp/ # Django project configuration (settings, urls, celery)
│ ├── requirements.txt
│ ├── static/
│ └── templates/
├── .gitattributes # Git LFS tracking rules
├── README.md # This file (English)
├── README_zh.md # Chinese version of README
└── web_environment.yml # Conda environment dependency file (generated by this project)
- Python 3.11+
- Conda (recommended for environment management)
- Redis Server
- (Optional, if using GPU) NVIDIA graphics card driver and CUDA Toolkit (e.g., 11.8+)
- Git LFS (for handling large data and model files)
a. Clone the repository:
bash git clone https://github.com/cihebi2/fusionprop.git cd fusionprop
b. Install Git LFS: (if not already installed)
Follow the instructions on the official Git LFS website. Then initialize it within the repository:
bash git lfs install git lfs pull # Pull LFS-managed large files
c. Create and activate Conda environment:
You can use the provided web_environment.yml file to create the environment (recommended):
bash conda env create -f web_environment.yml conda activate web # Or the environment name specified in the yml file
Alternatively, if you want to create it manually (similar to protein_webapp_env in the original README):
bash conda create -n fusionprop_env python=3.11 conda activate fusionprop_env # Then install dependencies from web/requirements.txt (may need adjustments to match the yml) # pip install -r web/requirements.txt
d. Configure environment variables (if needed):
Based on web/protein_webapp/settings.py, you might need to configure database connections, model paths, etc. You can create a .env file and use python-dotenv to load it, or set system environment variables directly.
e. Database migrations (for the web application):
bash cd web python manage.py migrate cd ..
f. Create a superuser (optional, for accessing Django Admin):
bash cd web python manage.py createsuperuser cd ..
Ensure the Conda environment is activated and the Redis server is running.
-
Start Redis Server: (Start according to your Redis installation method, e.g., by running
redis-serverdirectly) -
Start Celery Worker (in the
fusionprop/web/directory):cd web celery -A protein_webapp worker -l info -P gevent(If needed, you can start Celery Beat in another terminal:
celery -A protein_webapp beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler) -
Start Django development server (in the
fusionprop/web/directory):python manage.py runserver
By default, the website will run at
http://127.0.0.1:8000/.
The project is configured with Docker and Docker Compose for easy containerized deployment.
-
Prerequisites:
- Docker Desktop (Windows, macOS) or Docker Engine (Linux)
- Docker Compose V2
- (Optional, if using GPU) NVIDIA graphics card driver and NVIDIA Container Toolkit
-
Configuration Files:
web/Dockerfile: Defines the steps to build the application image.web/docker-compose.yml: Defines and orchestrates theweb(Django + Gunicorn),worker(Celery worker), andredisservices.
-
Running Steps (in the
fusionprop/web/directory): a. Build images (if you modified the Dockerfile or code and are not using a pre-built image):bash cd web docker-compose buildb. Start services:bash docker-compose up -d(The-dflag means run in detached mode in the background). After services start, the Django application will listen onhttp://localhost:8000(or the port configured indocker-compose.yml).c. View logs:
bash docker-compose logs -f docker-compose logs -f web docker-compose logs -f workerd. Stop services:
bash docker-compose down # Stop and remove containers # docker-compose stop # Stop containers only, do not removee. Restart services:bash docker-compose restart # Or # docker-compose down # docker-compose up -d
The extract_features/ directory contains scripts for extracting embedding features from protein sequences. These features can then be used to train predictive models.
extract_esm_1.py: Uses ESM-2 models (e.g.,facebook/esm2_t33_650M_UR50D) to extract features. It processes input CSV files, generates residue-level embeddings and mean-pooled protein-level representations for each sequence, and saves the results as.npyfiles. The script includes logic for sequence padding and masking.extract_esmc_1.py: Uses ESMC models (e.g.,esmc_600m) to extract features. Similar to the ESM-2 script but uses the CLS token for the protein-level representation.- Typically,
.shscripts are provided to conveniently run these Python scripts and may include parameterization for input files and output directories.
Example Usage (Conceptual):
cd extract_features
# conda activate <your_env_with_dependencies_like_transformers_esm>
# python extract_esm_1.py --input_csv ../data/your_sequences.csv --output_dir ./esm2_embeddings --model_name facebook/esm2_t33_650M_UR50D
# sh extract_esm_1.sh # (If the shell script is configured with parameters)
cd ..Please refer to the specific implementations within the scripts and the if __name__ == "__main__": section for actual parameters and paths.
The train_script/ directory contains scripts for training models for different protein property predictions. Each subdirectory corresponds to a specific property.
train_script/solubility/: E.g.,fusion_5_5_4_2_3.pyand corresponding.shscript for training solubility prediction models. These scripts typically load pre-extracted features, define the model architecture (such as a weighted fusion strategy), and perform the training and evaluation pipeline.train_script/thermostability/: E.g.,train_22_1_1.py, for training thermostability prediction models.train_script/toxicity/: E.g.,train_12_2.py(training) andevaluate_model.py(evaluation), for toxicity prediction models.
Example Usage (Conceptual):
cd train_script/toxicity
# conda activate <your_env_with_training_dependencies_like_pytorch_pandas_sklearn>
# python train_12_2.py --feature_path ../../extract_features/esm2_embeddings/ --label_file ../../data/toxicity_labels.csv --save_path ./trained_toxicity_model/
# sh train_12_3_3.sh # (If the shell script is configured with parameters)
cd ../..Refer to the specific instructions within each training script or its corresponding shell script for the exact commands and required parameters.
Once the web application is running successfully via local development mode or Docker:
- Open your browser and navigate to
http://localhost:8000(or the address and port you have configured). - Navigate to the prediction page (usually a link like "Start Prediction").
- You can paste one or more amino acid sequences directly or upload a FASTA-formatted file.
- After submitting the task, the system will process the request asynchronously. You can check the task status and retrieve the prediction results for solubility, thermostability, and toxicity upon completion.
- The results page typically provides detailed prediction values, confidence scores, and allows downloading the results as a CSV file.
- GPU/Memory Management: Protein language models and deep learning model training/inference consume significant computational resources. Ensure your environment has sufficient RAM (and VRAM, if using GPU). The model manager in the web application includes some auto-release mechanisms.
- Model Path Configuration: Correct configuration of model file paths is crucial for both local execution and Docker deployment. It's recommended to use environment variables священник with default paths in the code (like Hugging Face Hub IDs) for flexibility.
- Large Files: This project uses Git LFS to manage large data files and some model files. Ensure you have Git LFS installed and run
git lfs pullafter cloning the repository.
Contributions to this project are welcome! Please submit Pull Requests or create Issues to participate.
(Please add your project's license information here, e.g., MIT, Apache 2.0, etc. If undecided, you can leave it blank for now or write "To be determined".)
For the Chinese version, please see README_zh.md