Skip to content

THGLab/SmileyLlama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmileyLlama: Modifying Large Language Models\for Directed Chemical Space Exploration

This repository contains code and data used by the SmileyLlama project to train SmileyLlama and its variants, and to produce results used in the paper. The SmileyLlama model is not hosted here; rather, it's hosted on huggingface, along with variants trained for adhering to properties specified in a prompt and for generating binders to SARS-CoV-2 Main Protease (MPro).

For those who want a gentle, yet hands-on introduction to SmileyLlama, download the Demo.ipynb jupyter notebook, which provides a demonstration of SmileyLlama's abilities and a brief tutorial on writing prompts for it and related models.

System Requirements

Supervised fine-tuning and DPO of SmileyLlama is very memory-intensive due to the number of parameters. To replicate the work in this study, a 4xGPU node with 48GB VRAM per GPU is recommended. Lower VRAM will result in an out of memory error. For smaller setups, the gradient_accumulation_steps setting in the relevant axolotl configuration files (sft/8b-lora32/cf_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml) should be adjusted such that the overall batch size remains unchanged. In axolotl, the total batch size is the product of the micro batch size, gradient accumulation steps, and number of GPUs. This was tested using Nvidia A40 GPUs.

Inference on SmileyLlama should not be done on a GPU with less than 16 GB VRAM. Inference can be done using the CPU, but this will be slow.

Tested on python 3.10.12 (Python 3.10 can be found at python's website and version management can be done with pyenv (download and installation instructions)), gcc 11.4.0 (download) (installation instructions) and cuda 11.8.0 (download (installation instructions can be found after selecting an operating system and version)). Runs on Linux, was tested on Rocky Linux 8.10 (Green Obsidian).

Installation Guide

A few environments are required to be able to replicate the work in SmileyLlama, including finetuning the models. You will also need environments available to perform guacamol analysis, PLIP, and iMiner optimization.

Make sure to have python 3.10, and CUDA version 11 or 12 (tested on 11.8 and 12.2) installed (your GPU should be compatible with either, since some environments require cuda 11 .xand others require cuda 12.x). Python 3.10 can be found at python's website and version management can be done with pyenv. You can find also download cuda 11.8.0 here.

axo (use for fine-tuning)

cd envs
python -m venv axo
source axo/bin/activate
pip install packaging wheel psutil
pip install torch==2.3.1
pip install flash-attn==2.6.2 --no-build-isolation
pip install -r axo-requirements.txt

ana-env (main environment for analysis)

cd envs
python -m venv ana-env
source ana-env/bin/activate
pip install packaging wheel
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r ana-env-requirements.txt
cd ../scripts
pip install -e .
python -m ipykernel install --user --name=ana-env

Also, remember to create kernels for use in jupyter notebooks.

mol-benchmark (for guacamol analysis)

Follow steps on https://github.com/BenevolentAI/guacamol

Installing these will take somewhere on the order of 10 minutes on a "normal" desktop computer if all goes well. It can take much longer if flash attention is compiled (on the order of hours) instead of loaded from a prebuilt binary.

Demo

You can use the ana-env to run this demo, or anything with torch, transformers, and rdkit. The demo folder contains a jupyter notebook will take you through how to download and use SmileyLlama to generate molecules with some features. SmileyLlama's weights are about 16GB, so the time it takes to download them will vary based on your internet speed. Outside of this, a "normal" desktop will probably take on the order of 5 minutes to run the demo. The outputs of the notebook are already shown, although some part requires randomness.

Instructions for Use

To download and use SmileyLlama or its derivative models, you can visit this link . All scripts and jupyter notebooks in this codebase reference either these models or the Llama models by their huggingface identifiers (e.g. "THGLab/Llama-3.1-8B-SmileyLlama-1.1").

We've included code and data required to regenerate the main results of this paper.

The code and data required to generate SMILES strings and run the guacamol benchmark on Llama-3.1-Instruct is in llama_k_shot, code and data used to compare SmileyLlama and Llama is found in mmlu (need to install lm-evaluation-harness to run the benchmark). The lm-evaluation-harness can also be used to calculate perplexity on wikitest using

lm_eval --model=hf --model_args="pretrained=/path/to/model" --tasks=wikitext

sft

You can download the necessary data for this section from figshare in the sft directory with the following commands:

cd sft
wget -O random_smiles.jsonl https://ndownloader.figshare.com/files/60278828
wget -O chembl_random_smiles.txt https://ndownloader.figshare.com/files/60278825
wget -O chembl_33.csv https://ndownloader.figshare.com/files/60278831

To create new datasets with random smiles, use the chembl_random_smiles.txt and random_smiles.jsonl files, use the make_sft_data.ipynb notebook.

sft/8b-lora32 contains the config file used for axolotl to fine-tune. To restart fine-tuning, you'll need to fist gain access to Llama. You can do this by first pasting your huggingface access token after requesting access to Llama through your account, or acquiring Llama from another source, then specifying the path to the Llama weights instead of meta-llama/Llama-3.1-8B-Instruct in the first line. Then, preprocess, begin fine-tuning, and merge the LoRA into the weights.

# Export HuggingFace Token to download Llama (or modify the path to point to the weights)
export HF_TOKEN=<Your HuggingFace Token>
# Preprocess the data
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.preprocess cf_lora.yml
# Begin fine-tuning
srun accelerate launch -m axolotl.cli.train cf_lora.yml
# Merge the LoRA into the weights. The new, fine-tuned models' weights will be in `sft/outputs/merged`
python3 -m axolotl.cli.merge_lora $(pwd)/cf_lora.yml --lora_model_dir="$(pwd)/outputs"

prompt_following

Code and data for analyzing the ability of SmileyLlama to follow instructions in the prompt before and after DPO for instruction following can be found in the prompt_following folder.

Similarly to the previous section, to restart DPO, simply modify prompt_following/dpo-instr/cf_dpo_lora.yml to have the relevant paths in your system and run

srun accelerate launch --use-deepspeed -m axolotl.cli.train cf_dpo_lora.yml --dataset_processes=1
python3 -m axolotl.cli.merge_lora $(pwd)/cf_dpo_lora.yml --lora_model_dir="$(pwd)/outputs"

The data required for this can be found in prompt_following/dpo-instr/dpodataset/dpodataset.jsonl.

mpro

This directory contains all relevant parts of the project used to optimize SmileyLlama for inhibition of SARS-CoV-2 Main Protease (MPro). Analysis for a few sample ligands generated by SmileyLlama after this optimization (the model which generated them can be found on huggingface as THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro) can be found in mpro/ligand_analysis . files for DPO and outputs of the model throughout the training process can be found in mpro/run/. Outputs from iMiner used for comparison in our paper can be found in mpro/iminer_ref_details. The Jupyter notebooks used to generate figures relating to optimization for MPro inhibition can be found at mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb.

Reproducing the training run

You'll need to download and install ADFR (Downloads and installation instructions can be found here)

Next, you'll need to create a new environment for vina usage (conda and pip are both required for usage here)

# In the main SmileyLlama directory and base conda environment
cd envs
conda env create -f vina.yml -n vina
conda activate vina
cd ..
git clone https://github.com/THGLab/iMiner
cd iMiner
git checkout smileyllama
pip install -e .
 cd ..
pip install selfies
pip install openbabel-wheel

Ensure that the ADFR_install_path in line 6 of iMiner/iMiner/pathlib.py points to the compiled ADFR binary.

vina-cpu

Reproducing the mpro optimization using vina (on the CPU rather than GPU) is generally easier and less finicky, though you may still need to troubleshoot depending on your system.

You'll need to change the iMiner config file reference by the reward function calculator, so change directory to mpro/tools and edit this with sed. You'll also need to specify the path to the SARS-Cov-2 MPro PDB file through editing it directly or with sed.

cd mpro/tools
sed -i.bak 's|integrated_config.yaml|integrated_config_cpu.yaml|g' integrated_score.py
sed -i.bak "s|/path/to/Mpro_7L11.pdb|$(pwd)/Mpro_7L11.pdb|g" integrated_config_cpu.yaml

Finally, although there is an autodock vina binary file in iMiner/iMiner/docking/bins, you may need to recompile this for your system. Instructions for doing so can be found here.

vina-gpu

To reproduce our Mpro optimization, you'll need to install iMiner and Vina GPU. However, this will generally involve recompiling Vina GPU.

Create the vina-gpu environment.

# In the main SmileyLlama directory and base conda environment
cd envs
conda env create -f vina-gpu-env.yml -n vina-gpu-env

Change the paths on the yaml file

cd mpro/tools
sed -i.bak "s|/path/to/Mpro_7L11.pdb|$(pwd)/Mpro_7L11.pdb|g" integrated_config.yaml
cd -

Finally, you'll need to compile and install Vina-GPU using the instructions on their github page in the vina-gpu-env virtual environment. Add the vina-gpu environment to your LD Library path with LD_LIBRARY_PATH=/path/to/that/environment/lib:$LD_LIBRARY_PATH Ensure that boost is added to your path with export LD_LIBRARY_PATH=/path/to/boost_1_83_0/lib:$LD_LIBRARY_PATH You will additionally need to

Running the optimization

Go to mpro/run and either copy or download the weights into a directory labeled outputs/merged

# download weights (using the ana-env environment)
huggingface-cli download THGLab/Llama-3.1-8B-SmileyLlama-1.1 --local-dir outputs/merged

Modify the prep_iter.py file to point to the vina env's python and the iMiner pathlib.py file to be able to access scripts in the vina environment.

# while in the SmileyLlama main directory and vina conda environment
sed -i "s|/path/to/vina/env/bin|$CONDA_PREFIX/bin/|g" iMiner/iMiner/pathlib.py
sed -i "s|/path/to/vinaenv/bin/python|$CONDA_PREFIX/bin/python|g" mpro/run/prep_iter.py

Finally, start the training run with

cd mpro/run
bash run.sh

which will run the procedure for MPro optimization through DPO as discussed in our paper.

To reproduce key results

To reproduce the Llama 0-shot and 20-shot values in Table 1, use the llama_k_shot/guacamol_analysis.py notebook. These analyze pre-generated molecules which were produced with the gen_0_shot and gen_20_shot scripts. To reproduce the SmileyLlama values in Table 1 and Figure S1, use the sft/guacamol_analysis.ipynb jupyter notebook. To reproduce the visualizations of properties in Figure 2, use the sft/distribution_vis.ipynb notebook To reproduce the SFT and DPO results in Table 2 and Figure 3b, use the prompt_following/prompt_following_analysis.ipynb notebook. To reproduce Figure 3a, use the prompt_following/figure3a.ipynb notebook. To reproduce Figure 4, use the mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb notebooks. To visualize the interactions between selected generations and Mpro as in Figure 5, use the results from the Protein-Ligand Interaction Profiler (PLIP) in mpro/ligand_analysis/plip

About

Modifying Large Language Models for Directed Chemical Space Exploration

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published