Skip to content

UCREL/gpu_test

Repository files navigation

GPU-Test

The primary function of this repository it to test a GPU machine on an academic/more real world task with a reasonable workload using popular machine learning tools within the Natural Language Processing (NLP) field.

The tools the script uses are the following:

  • accelerate>=1.10.1 -- A HuggingFace library that automatically puts models/tensors on to the best device automatically or a device of your choosing, e.g. GPU/CPU/Distributed/etc.
  • datasets>=4.0.0 -- A HuggingFace library that contains a lot of datasets
  • torch>=2.8.0 -- Machine learning library that is used by accelerate and transformers.
  • transformers>=4.56.1 -- A HuggingFace library that contains many mainly transformer based Pytorch models.

The script that is used to test the GPU functionality can also test for:

  • CPU or GPU training.
  • Mixed precision training: No mixed precision, BF16, and FP16.

Installation

pip install -r requirements.txt

Development

For development we use uv, to create a virtual env with uv already installed.

uv sync

To lint format:

make format

To link check:

make check

To export the requirements file:

uv export --no-dev --format requirements.txt > ./requirements.txt
Dev container

There is also a dev container you can use, see ./.devcontainer folder, which is setup for a Nvidia GPU machine.

Run GPU test

To run it with the standard configuration, to note that the first time this is ran on a machine it will download a small (34 million parameter model and a small dataset):

python main.py

The output will look like:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at BAAI/bge-small-en-v1.5 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:__main__:Training on device: cuda
INFO:__main__:Training with precision level/type: no
INFO:__main__:Training with 1 CPU processes
INFO:__main__:Training for 4 epochs
INFO:__main__:Training for 316 steps
INFO:__main__:--------------------------------------------------                                                                  | 78/316 [00:07<00:13, 17.10it/s]
INFO:__main__:epoch: 0 split: validation metric: accuracy value: 0.5576                                                                                            
INFO:__main__:epoch: 0 split: validation metric: f1 value: 0.0000
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.26it/s]
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 1 split: validation metric: accuracy value: 0.5587                                                                    | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 1 split: validation metric: f1 value: 0.0151                                                                                                  
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.18it/s]
INFO:__main__:--------------------------------------------------                                                                           | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 2 split: validation metric: accuracy value: 0.5749                                                                                            
INFO:__main__:epoch: 2 split: validation metric: f1 value: 0.1912
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.26it/s]
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 3 split: validation metric: accuracy value: 0.5810                                                                    | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 3 split: validation metric: f1 value: 0.3534                                                                                                  
INFO:__main__:--------------------------------------------------
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 3 split: test metric: accuracy value: 0.5744                                                                                                  
INFO:__main__:epoch: 3 split: test metric: f1 value: 0.3596
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 316/316 [00:40<00:00,  7.89it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:05<00:00, 20.94it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:03<00:00, 41.54it/s]

The script has a few configurations that are explained through the help function:

python main.py --help
 Usage: main.py [OPTIONS]                                                                                                                                          
                                                                                                                                                                   
 Trains the base model on ~10% of the Google Research PAWS dataset. The dataset is a paraphrase dataset that contains 66,000 samples of which we only use ~5,000   
 for training and keep the original validation and test sets that contain 8,000 samples each. The dataset can be found here:                                       
 https://huggingface.co/datasets/google-research-datasets/paws and the paper here: https://arxiv.org/pdf/1904.01130. We are going to measure the performance using 
 accuracy and F1.                                                                                                                                                  
                                                                                                                                                                   
 It takes at least 8-15 epochs to start getting F1 scores > 0.6 with the default configuration.                                                                    
 NOTE: If you use the CPU it will be very slow. NOTE: This script will download the model, it's tokenizer, and the data  required to train the model.              
                                                                                                                                                                   
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --base-model                         TEXT            Base model to train. The default (BAAI/bge-small-en-v1.5) model is only 34 million parameters.             │
│                                                      [default: BAAI/bge-small-en-v1.5]                                                                          │
│ --verbose            --no-verbose                    If True logging will be verbose [default: no-verbose]                                                      │
│ --cpu                --no-cpu                        If True train on CPU only even if GPU is available [default: no-cpu]                                       │
│ --mixed-precision                    [no|fp16|bf16]  Whether to use mixed precision training [default: no]                                                      │
│ --epochs                             INTEGER         Number of epochs to train for [default: 4]                                                                 │
│ --batch-size                         INTEGER         Batch size of the model for training and evaluating [default: 64]                                          │
│ --help                                               Show this message and exit.                                                                                │
╰─────────────────────────

Run GPU test SLURM

First time

  1. Clone this repository: git clone https://github.com/UCREL/gpu_test.git
  2. Move to that directory cd gpu_test
  3. Create a virtual env: python -m venv ./venv
  4. Source the environment: source ./venv/bin/activate
  5. Download the python packages: pip install -r requirements.txt

Normal usage after first time

To just test with a gpu machine either on the a2000 or a5000 GPUs:

a2000:

srun --partition=a2000-5m python main.py

a5000:

srun --partition=a5000-5m python main.py

Test for torch compilation

If the model you are using uses torch.compile then additional dependecies need to be installed on HEX for this to work.

To test if all GPU nodes support this on HEX run the following which runs the main.py python code using the model jhu-clsp/ettin-encoder-17m which utilises torch.compile on 20 array jobs hopefully using all available GPUs from the a2000 and a5000 partitions of HEX (only use the 5m partitions), it then saves all nodes logs to ./log/gpu_test_%A folder where %A is the array job ID, you can then see each nodes log by cat ./log/gpu_test_%A/* and if torch.compile works on HEX each job should have finished without error, if it is not then the job would have failed with an error.

sbatch batch_test_all_nodes.sh

About

Script to test GPU functionality

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published