Skip to content

AzusaXuan/PRO-LDM

Repository files navigation

PRO-LDM

Implementation of PRO-LDM: A Conditional Latent Diffusion Model for Protein Sequence Design and Functional Optimization by Sitao Zhang*, Zixuan Jiang*, Rundong Huang, Shaoxun Mo, Letao Zhu, Peiheng Li, Ziyi Zhang, Emily Pan, Xi Chen, Yunfei Long, Qi Liang, Jin Tang, Renjing Xu†, and Rui Qing†. Please feel free to reach out to us at zjiang597@connect.hkust-gz.edu.cn or zhangsitao@sjtu.edu.cn with any questions.

image

Introduction

Here we present PRO-LDM: a modular multi-tasking framework combining design fidelity and computational efficiency, by integrating the diffusion model in latent space to perform protein design.

PRO-LDM

  • learns biological representations from natural proteins from both amino acids and whole sequence at local and global levels

  • designs natural-like new sequences with enhanced diversity

  • conditionally designs new proteins with tailored properties or functions

Out-of-distribution (OOD) design can be implemented by adjusting the classifier-free guidance that enables PRO-LDM to sample notably different regions in the latent space for functional optimization of natural proteins.

Environment Setup

# create new environment
conda create -n proldm_env python=3.8
conda activate proldm_env
# prepare for new environment by self is recommmended, or you can use the following command
pip3 install -r requirements.txt

How to Run

Pre-train Models

in main.py you can set

args.mode=train 

to train models,

you can set

args.multi_gpu=True 

to use DataParallel Training,

use

args.device_id 

to set GPU number.

# 4-GPU training
python main.py --mode train --dataset <dataset_name> --multi_gpu True --device_id [0, 1, 2, 3]
# 1-GPU training
python main.py --mode train --dataset <dataset_name> --multi_gpu False --device_id [0]
# available args.dataset and corresponding args.num_labels
# conditional datasets
# n-label means you can choose dif_sample_label=0 for unconditional sample, and dif_sample_label=1~n for conditional sample
5-label: NESP, ube4b
8-label: gifford, GFP, TAPE, pab1, bgl3, HIS7, CAPSD, B1LPA6
# unconditional datasets
0-label: MSA, MSA_RAW, MDH

Unconditional Training

# for example: train unconditional dataset MDH
python main.py --mode train --dataset MDH -n_epochs 1000
# ckpt is saved in path: PROLDM_OUTLIER/train_logs/MDH

Conditional Training

# train conditional dataset TAPE
python main.py --mode train --dataset TAPE --n_epochs 1000
# ckpt is saved in path: PROLDM_OUTLIER/train_logs/TAPE
# expected running time: ~5 hours on 4*V100

Evaluate Models

we use dataset TAPE as an example, and the result is generated in path: PROLDM_OUTLIER/test_output/TAPE

python main.py --mode eval --dataset <dataset_name> --eval_load_epoch <eval_epoch>
# for example
python main.py --mode eval --dataset TAPE --eval_load_epoch 1000

Seq Generation

we use dataset TAPE as an example, and the result is generated in path: PROLDM_OUTLIER/generated_seq/TAPE

Label 0 represents unconditional generation, and label 1-8 represent conditional generation.

python main.py --mode sample --dataset <dataset_name> --dif_sample_label <label> --dif_sample_epoch <epoch>
# for example
python main.py --mode sample --dataset TAPE --dif_sample_label 0 --dif_sample_epoch 1000

Others

you can also use .sh file to run the code

#/bin/sh
#BSUB -m node01-10
#BSUB -J PROLDM
#BSUB -n 8
#BSUB -W 7200
#BSUB -gpu "num=8"
#BSUB -o out.txt
#BSUB -e err.txt

module load anaconda3
module load cuda/11.6
source activate proldm_env

python main.py --mode train --dataset TAPE

You can access all raw code, checkpoints and data via the following command in terminal.

pip install gdown
gdown --folder https://drive.google.com/drive/folders/1tX9PSrywPhW62HlExn3lWj2IGEsSzKFv

Model Checkpoint

we save pre-trained model weights in the folder ./train_logs

Training Datasets

All training datasets are preserved in the folder ./data and most of them are divided into two csv files: <datset_name>-train.csv and <dataset_name>-test.csv.

License

The code and model weights are released under MIT license. See the LICENSE for details.

Citation

@article{zhang2025pro,
  title={PRO-LDM: A Conditional Latent Diffusion Model for Protein Sequence Design and Functional Optimization},
  author={Zhang, Sitao and Jiang, Zixuan and Huang, Rundong and Huang, Wenting and Peng, Siyuan and Mo, Shaoxun and Zhu, Letao and Li, Peiheng and Zhang, Ziyi and Pan, Emily and others},
  journal={Advanced Science},
  pages={e02723},
  year={2025},
  publisher={Wiley Online Library}
}

Credits

The code of this work is mainly built on following repositories (many thanks to them), you could also check them if encountering any problems:

  1. Egbert Castro's ReLSO (https://github.com/KrishnaswamyLab/ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers)
  2. Phil Wang's Pytorch code for Classifier Free Guidance (https://github.com/lucidrains/classifier-free-guidance-pytorch)

About

[Adv. Sci. 2025] PRO-LDM: A Conditional Latent Diffusion Model for Protein Sequence Design and Functional Optimization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published