Implementation of PRO-LDM: A Conditional Latent Diffusion Model for Protein Sequence Design and Functional Optimization by Sitao Zhang*, Zixuan Jiang*, Rundong Huang, Shaoxun Mo, Letao Zhu, Peiheng Li, Ziyi Zhang, Emily Pan, Xi Chen, Yunfei Long, Qi Liang, Jin Tang, Renjing Xu†, and Rui Qing†. Please feel free to reach out to us at zjiang597@connect.hkust-gz.edu.cn or zhangsitao@sjtu.edu.cn with any questions.
Here we present PRO-LDM: a modular multi-tasking framework combining design fidelity and computational efficiency, by integrating the diffusion model in latent space to perform protein design.
PRO-LDM
-
learns biological representations from natural proteins from both amino acids and whole sequence at local and global levels
-
designs natural-like new sequences with enhanced diversity
-
conditionally designs new proteins with tailored properties or functions
Out-of-distribution (OOD) design can be implemented by adjusting the classifier-free guidance that enables PRO-LDM to sample notably different regions in the latent space for functional optimization of natural proteins.
# create new environment
conda create -n proldm_env python=3.8
conda activate proldm_env
# prepare for new environment by self is recommmended, or you can use the following command
pip3 install -r requirements.txtin main.py you can set
args.mode=train to train models,
you can set
args.multi_gpu=True
to use DataParallel Training,
use
args.device_id
to set GPU number.
# 4-GPU training
python main.py --mode train --dataset <dataset_name> --multi_gpu True --device_id [0, 1, 2, 3]
# 1-GPU training
python main.py --mode train --dataset <dataset_name> --multi_gpu False --device_id [0]# available args.dataset and corresponding args.num_labels
# conditional datasets
# n-label means you can choose dif_sample_label=0 for unconditional sample, and dif_sample_label=1~n for conditional sample
5-label: NESP, ube4b
8-label: gifford, GFP, TAPE, pab1, bgl3, HIS7, CAPSD, B1LPA6
# unconditional datasets
0-label: MSA, MSA_RAW, MDH# for example: train unconditional dataset MDH
python main.py --mode train --dataset MDH -n_epochs 1000
# ckpt is saved in path: PROLDM_OUTLIER/train_logs/MDH# train conditional dataset TAPE
python main.py --mode train --dataset TAPE --n_epochs 1000
# ckpt is saved in path: PROLDM_OUTLIER/train_logs/TAPE
# expected running time: ~5 hours on 4*V100we use dataset TAPE as an example, and the result is generated in path: PROLDM_OUTLIER/test_output/TAPE
python main.py --mode eval --dataset <dataset_name> --eval_load_epoch <eval_epoch>
# for example
python main.py --mode eval --dataset TAPE --eval_load_epoch 1000we use dataset TAPE as an example, and the result is generated in path: PROLDM_OUTLIER/generated_seq/TAPE
Label 0 represents unconditional generation, and label 1-8 represent conditional generation.
python main.py --mode sample --dataset <dataset_name> --dif_sample_label <label> --dif_sample_epoch <epoch>
# for example
python main.py --mode sample --dataset TAPE --dif_sample_label 0 --dif_sample_epoch 1000you can also use .sh file to run the code
#/bin/sh
#BSUB -m node01-10
#BSUB -J PROLDM
#BSUB -n 8
#BSUB -W 7200
#BSUB -gpu "num=8"
#BSUB -o out.txt
#BSUB -e err.txt
module load anaconda3
module load cuda/11.6
source activate proldm_env
python main.py --mode train --dataset TAPEpip install gdown
gdown --folder https://drive.google.com/drive/folders/1tX9PSrywPhW62HlExn3lWj2IGEsSzKFv
we save pre-trained model weights in the folder ./train_logs
All training datasets are preserved in the folder ./data and most of them are divided into two csv files: <datset_name>-train.csv and <dataset_name>-test.csv.
The code and model weights are released under MIT license. See the LICENSE for details.
@article{zhang2025pro,
title={PRO-LDM: A Conditional Latent Diffusion Model for Protein Sequence Design and Functional Optimization},
author={Zhang, Sitao and Jiang, Zixuan and Huang, Rundong and Huang, Wenting and Peng, Siyuan and Mo, Shaoxun and Zhu, Letao and Li, Peiheng and Zhang, Ziyi and Pan, Emily and others},
journal={Advanced Science},
pages={e02723},
year={2025},
publisher={Wiley Online Library}
}
The code of this work is mainly built on following repositories (many thanks to them), you could also check them if encountering any problems:
- Egbert Castro's ReLSO (https://github.com/KrishnaswamyLab/ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers)
- Phil Wang's Pytorch code for Classifier Free Guidance (https://github.com/lucidrains/classifier-free-guidance-pytorch)
