Skip to content

Assign GPU to MPI rank #261

@uvilla

Description

@uvilla

The current way that in TPS we assign a GPU to each MPI rank is (see here ):

device_id = mpi_rank % numGpusPerRank

where numGpusPerRank is set from the .ini file.
The default value of this variable is 1; see here. None of the *.ini input files in our testsuite changes the default value, so I am assuming all local jobs are running on a single GPU.

This makes TPS hard to port across different clusters and local machines. Some schedulers (e.g. those on TACC) make all GPUs on a node available to all tasks on that node, while other schedulers (e.g. flux) restrict which GPUs are visible to each task (e.g. through the variable ROCR_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, NVIDIA_VISIBLE_DEVICES, or ``CUDA_VISIBLE_DEVICES`).

I propose a more flexible way to handle this by introducing the command line argument --gpu-affinity (short-hand -ga).

Three affinity policies will be available:

  • default: Set the deviceID to 0. This is perfect for local resources with a single GPU or when the scheduler restricts which devices are visible to a task (like flux does).

  • direct (default): Set the deviceID equal to the mpi-rank. This is perfect on a single node (local or on the cluster) when the number of mpi-tasks is less or equal to the number of GPUs.

  • env-localid: the device id is set through an environmental variable defined with --localid-varname. Many schedulers set an environmental variable that provides a local numbering of the tasks running on a specific node. In slurm, this variable is called SLURM_LOCALID, in flux FLUX_TASK_LOCAL_ID. See also: https://docs.nersc.gov/jobs/affinity/#gpus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions