The primary function of this repository it to test a GPU machine on an academic/more real world task with a reasonable workload using popular machine learning tools within the Natural Language Processing (NLP) field.
The tools the script uses are the following:
- accelerate>=1.10.1 -- A HuggingFace library that automatically puts models/tensors on to the best device automatically or a device of your choosing, e.g. GPU/CPU/Distributed/etc.
- datasets>=4.0.0 -- A HuggingFace library that contains a lot of datasets
- torch>=2.8.0 -- Machine learning library that is used by accelerate and transformers.
- transformers>=4.56.1 -- A HuggingFace library that contains many mainly transformer based Pytorch models.
The script that is used to test the GPU functionality can also test for:
- CPU or GPU training.
- Mixed precision training: No mixed precision, BF16, and FP16.
pip install -r requirements.txtFor development we use uv, to create a virtual env with uv already installed.
uv syncTo lint format:
make formatTo link check:
make checkTo export the requirements file:
uv export --no-dev --format requirements.txt > ./requirements.txtThere is also a dev container you can use, see ./.devcontainer folder, which is setup for a Nvidia GPU machine.
To run it with the standard configuration, to note that the first time this is ran on a machine it will download a small (34 million parameter model and a small dataset):
python main.pyThe output will look like:
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at BAAI/bge-small-en-v1.5 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:__main__:Training on device: cuda
INFO:__main__:Training with precision level/type: no
INFO:__main__:Training with 1 CPU processes
INFO:__main__:Training for 4 epochs
INFO:__main__:Training for 316 steps
INFO:__main__:-------------------------------------------------- | 78/316 [00:07<00:13, 17.10it/s]
INFO:__main__:epoch: 0 split: validation metric: accuracy value: 0.5576
INFO:__main__:epoch: 0 split: validation metric: f1 value: 0.0000
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.26it/s]
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 1 split: validation metric: accuracy value: 0.5587 | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 1 split: validation metric: f1 value: 0.0151
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.18it/s]
INFO:__main__:-------------------------------------------------- | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 2 split: validation metric: accuracy value: 0.5749
INFO:__main__:epoch: 2 split: validation metric: f1 value: 0.1912
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:08<00:00, 14.26it/s]
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 3 split: validation metric: accuracy value: 0.5810 | 0/125 [00:00<?, ?it/s]
INFO:__main__:epoch: 3 split: validation metric: f1 value: 0.3534
INFO:__main__:--------------------------------------------------
INFO:__main__:--------------------------------------------------
INFO:__main__:epoch: 3 split: test metric: accuracy value: 0.5744
INFO:__main__:epoch: 3 split: test metric: f1 value: 0.3596
INFO:__main__:--------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 316/316 [00:40<00:00, 7.89it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:05<00:00, 20.94it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [00:03<00:00, 41.54it/s]The script has a few configurations that are explained through the help function:
python main.py --help
Usage: main.py [OPTIONS]
Trains the base model on ~10% of the Google Research PAWS dataset. The dataset is a paraphrase dataset that contains 66,000 samples of which we only use ~5,000
for training and keep the original validation and test sets that contain 8,000 samples each. The dataset can be found here:
https://huggingface.co/datasets/google-research-datasets/paws and the paper here: https://arxiv.org/pdf/1904.01130. We are going to measure the performance using
accuracy and F1.
It takes at least 8-15 epochs to start getting F1 scores > 0.6 with the default configuration.
NOTE: If you use the CPU it will be very slow. NOTE: This script will download the model, it's tokenizer, and the data required to train the model.
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --base-model TEXT Base model to train. The default (BAAI/bge-small-en-v1.5) model is only 34 million parameters. │
│ [default: BAAI/bge-small-en-v1.5] │
│ --verbose --no-verbose If True logging will be verbose [default: no-verbose] │
│ --cpu --no-cpu If True train on CPU only even if GPU is available [default: no-cpu] │
│ --mixed-precision [no|fp16|bf16] Whether to use mixed precision training [default: no] │
│ --epochs INTEGER Number of epochs to train for [default: 4] │
│ --batch-size INTEGER Batch size of the model for training and evaluating [default: 64] │
│ --help Show this message and exit. │
╰─────────────────────────- Clone this repository:
git clone https://github.com/UCREL/gpu_test.git - Move to that directory
cd gpu_test - Create a virtual env:
python -m venv ./venv - Source the environment:
source ./venv/bin/activate - Download the python packages:
pip install -r requirements.txt
To just test with a gpu machine either on the a2000 or a5000 GPUs:
a2000:
srun --partition=a2000-5m python main.pya5000:
srun --partition=a5000-5m python main.pyIf the model you are using uses torch.compile then additional dependecies need to be installed on HEX for this to work.
To test if all GPU nodes support this on HEX run the following which runs the main.py python code using the model jhu-clsp/ettin-encoder-17m which utilises torch.compile on 20 array jobs hopefully using all available GPUs from the a2000 and a5000 partitions of HEX (only use the 5m partitions), it then saves all nodes logs to ./log/gpu_test_%A folder where %A is the array job ID, you can then see each nodes log by cat ./log/gpu_test_%A/* and if torch.compile works on HEX each job should have finished without error, if it is not then the job would have failed with an error.
sbatch batch_test_all_nodes.sh