Skip to content

cihebi2/fusionprop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FusionProp: Protein Property Prediction Platform

FusionProp is a web-based platform designed for rapid and accurate prediction of multiple key protein properties, including solubility, thermostability, and toxicity. It leverages deep learning models and features from advanced Protein Language Models (PLMs) to make predictions directly from amino acid sequences, eliminating the need for 3D protein structures.

Key Features

  • Multi-Task Prediction: Obtain predictions for solubility, thermostability, and toxicity simultaneously in a single run.
  • Structure-Free: Predictions are entirely sequence-based, avoiding time-consuming and computationally intensive protein structure prediction steps.
  • High Performance: Utilizes optimized models and asynchronous task processing for fast predictions, suitable for high-throughput screening.
  • High Accuracy: Models achieve or approach state-of-the-art performance on benchmark datasets.
  • User-Friendly: Provides a clean web interface supporting direct sequence pasting or FASTA file uploads, with clear result presentation and download options.
  • Modular Design: Includes separate modules for feature extraction, model training, and web application services.

Technology Stack

  • Backend: Django
  • Asynchronous Tasks: Celery, Redis (as message broker and result backend)
  • Machine Learning/Deep Learning: PyTorch
  • Protein Language Models:
    • ESM (e.g., Facebook's ESM-2 series)
    • ESMC
  • Feature Extraction: Uses the transformers library (for ESM-2) and esm (from the OpenFold team, for ESMC)
  • Database: SQLite (default, configurable to PostgreSQL, etc.)
  • Web Server/Deployment: Gunicorn, Docker, Nginx (recommended for reverse proxy in production)
  • Frontend: HTML, CSS, JavaScript, Bootstrap

Directory Structure

fusionprop/
├── data/                     # Stores training, testing datasets, and related raw data
├── extract_features/         # Contains scripts for extracting protein features using different PLMs
│   ├── extract_esm_1.py      # Extracts features using ESM-2
│   ├── extract_esmc_1.py     # Extracts features using ESMC
│   └── ...                   # Other feature extraction scripts and related shell scripts
├── train_script/             # Contains scripts for training models for different property predictions
│   ├── solubility/           # Solubility model training scripts (e.g., fusion_5_5_4_2_3.py)
│   ├── thermostability/      # Thermostability model training scripts (e.g., train_22_1_1.py)
│   └── toxicity/             # Toxicity model training scripts (e.g., train_12_2.py, evaluate_model.py)
├── web/                      # Core code for the Django web application and API (main body of the original project)
│   ├── Dockerfile
│   ├── docker-compose.yml
│   ├── manage.py
│   ├── predictor/            # Django app handling prediction logic, forms, views, tasks
│   ├── protein_feature_extractor/ # Feature extractor management module
│   ├── protein_predictor/    # Management and implementation of various prediction models
│   ├── protein_webapp/       # Django project configuration (settings, urls, celery)
│   ├── requirements.txt
│   ├── static/
│   └── templates/
├── .gitattributes            # Git LFS tracking rules
├── README.md                 # This file (English)
├── README_zh.md              # Chinese version of README
└── web_environment.yml       # Conda environment dependency file (generated by this project)

Environment Setup and Installation

1. Prerequisites

  • Python 3.11+
  • Conda (recommended for environment management)
  • Redis Server
  • (Optional, if using GPU) NVIDIA graphics card driver and CUDA Toolkit (e.g., 11.8+)
  • Git LFS (for handling large data and model files)

2. Installation Steps

a. Clone the repository: bash git clone https://github.com/cihebi2/fusionprop.git cd fusionprop

b. Install Git LFS: (if not already installed) Follow the instructions on the official Git LFS website. Then initialize it within the repository: bash git lfs install git lfs pull # Pull LFS-managed large files

c. Create and activate Conda environment: You can use the provided web_environment.yml file to create the environment (recommended): bash conda env create -f web_environment.yml conda activate web # Or the environment name specified in the yml file Alternatively, if you want to create it manually (similar to protein_webapp_env in the original README): bash conda create -n fusionprop_env python=3.11 conda activate fusionprop_env # Then install dependencies from web/requirements.txt (may need adjustments to match the yml) # pip install -r web/requirements.txt

d. Configure environment variables (if needed): Based on web/protein_webapp/settings.py, you might need to configure database connections, model paths, etc. You can create a .env file and use python-dotenv to load it, or set system environment variables directly.

e. Database migrations (for the web application): bash cd web python manage.py migrate cd ..

f. Create a superuser (optional, for accessing Django Admin): bash cd web python manage.py createsuperuser cd ..

Running FusionProp

Local Development Mode (Web Application)

Ensure the Conda environment is activated and the Redis server is running.

  1. Start Redis Server: (Start according to your Redis installation method, e.g., by running redis-server directly)

  2. Start Celery Worker (in the fusionprop/web/ directory):

    cd web
    celery -A protein_webapp worker -l info -P gevent

    (If needed, you can start Celery Beat in another terminal: celery -A protein_webapp beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler)

  3. Start Django development server (in the fusionprop/web/ directory):

    python manage.py runserver

    By default, the website will run at http://127.0.0.1:8000/.

Docker Deployment Mode

The project is configured with Docker and Docker Compose for easy containerized deployment.

  1. Prerequisites:

    • Docker Desktop (Windows, macOS) or Docker Engine (Linux)
    • Docker Compose V2
    • (Optional, if using GPU) NVIDIA graphics card driver and NVIDIA Container Toolkit
  2. Configuration Files:

    • web/Dockerfile: Defines the steps to build the application image.
    • web/docker-compose.yml: Defines and orchestrates the web (Django + Gunicorn), worker (Celery worker), and redis services.
  3. Running Steps (in the fusionprop/web/ directory): a. Build images (if you modified the Dockerfile or code and are not using a pre-built image): bash cd web docker-compose build b. Start services: bash docker-compose up -d (The -d flag means run in detached mode in the background). After services start, the Django application will listen on http://localhost:8000 (or the port configured in docker-compose.yml).

    c. View logs: bash docker-compose logs -f docker-compose logs -f web docker-compose logs -f worker

    d. Stop services: bash docker-compose down # Stop and remove containers # docker-compose stop # Stop containers only, do not remove e. Restart services: bash docker-compose restart # Or # docker-compose down # docker-compose up -d

Feature Extraction

The extract_features/ directory contains scripts for extracting embedding features from protein sequences. These features can then be used to train predictive models.

  • extract_esm_1.py: Uses ESM-2 models (e.g., facebook/esm2_t33_650M_UR50D) to extract features. It processes input CSV files, generates residue-level embeddings and mean-pooled protein-level representations for each sequence, and saves the results as .npy files. The script includes logic for sequence padding and masking.
  • extract_esmc_1.py: Uses ESMC models (e.g., esmc_600m) to extract features. Similar to the ESM-2 script but uses the CLS token for the protein-level representation.
  • Typically, .sh scripts are provided to conveniently run these Python scripts and may include parameterization for input files and output directories.

Example Usage (Conceptual):

cd extract_features
# conda activate <your_env_with_dependencies_like_transformers_esm>
# python extract_esm_1.py --input_csv ../data/your_sequences.csv --output_dir ./esm2_embeddings --model_name facebook/esm2_t33_650M_UR50D
# sh extract_esm_1.sh # (If the shell script is configured with parameters)
cd ..

Please refer to the specific implementations within the scripts and the if __name__ == "__main__": section for actual parameters and paths.

Model Training

The train_script/ directory contains scripts for training models for different protein property predictions. Each subdirectory corresponds to a specific property.

  • train_script/solubility/: E.g., fusion_5_5_4_2_3.py and corresponding .sh script for training solubility prediction models. These scripts typically load pre-extracted features, define the model architecture (such as a weighted fusion strategy), and perform the training and evaluation pipeline.
  • train_script/thermostability/: E.g., train_22_1_1.py, for training thermostability prediction models.
  • train_script/toxicity/: E.g., train_12_2.py (training) and evaluate_model.py (evaluation), for toxicity prediction models.

Example Usage (Conceptual):

cd train_script/toxicity
# conda activate <your_env_with_training_dependencies_like_pytorch_pandas_sklearn>
# python train_12_2.py --feature_path ../../extract_features/esm2_embeddings/ --label_file ../../data/toxicity_labels.csv --save_path ./trained_toxicity_model/
# sh train_12_3_3.sh # (If the shell script is configured with parameters)
cd ../..

Refer to the specific instructions within each training script or its corresponding shell script for the exact commands and required parameters.

Using the Web Application

Once the web application is running successfully via local development mode or Docker:

  1. Open your browser and navigate to http://localhost:8000 (or the address and port you have configured).
  2. Navigate to the prediction page (usually a link like "Start Prediction").
  3. You can paste one or more amino acid sequences directly or upload a FASTA-formatted file.
  4. After submitting the task, the system will process the request asynchronously. You can check the task status and retrieve the prediction results for solubility, thermostability, and toxicity upon completion.
  5. The results page typically provides detailed prediction values, confidence scores, and allows downloading the results as a CSV file.

Important Notes

  • GPU/Memory Management: Protein language models and deep learning model training/inference consume significant computational resources. Ensure your environment has sufficient RAM (and VRAM, if using GPU). The model manager in the web application includes some auto-release mechanisms.
  • Model Path Configuration: Correct configuration of model file paths is crucial for both local execution and Docker deployment. It's recommended to use environment variables священник with default paths in the code (like Hugging Face Hub IDs) for flexibility.
  • Large Files: This project uses Git LFS to manage large data files and some model files. Ensure you have Git LFS installed and run git lfs pull after cloning the repository.

Contributing

Contributions to this project are welcome! Please submit Pull Requests or create Issues to participate.

License

(Please add your project's license information here, e.g., MIT, Apache 2.0, etc. If undecided, you can leave it blank for now or write "To be determined".)


For the Chinese version, please see README_zh.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published