Skip to content

at-aaims/FrontierAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FrontierAI

AI related notes on Frontier supercomputer at OLCF

Contents

FlashAttention

Installation

FA2 is supported on Frontier and the upstream repo can be pip installed

module load PrgEnv-gnu
module load rocm
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh -b -p $WRKSPC/miniconda
export PATH=$WRKSPC/miniconda/bin:$PATH
conda create --prefix $WRKSPC/miniconda/envs/fa2-env -y
source $WRKSPC/miniconda/etc/profile.d/conda.sh
conda activate $WRKSPC/miniconda/envs/fa2-env
git clone https://github.com/Dao-AILab/flash-attention
pushd flash-attention
git checkout v2.8.3
pip install -e .
popd

For latest development, please try AMD's fork

Backend

Standalone FA:

  • CK (default)
  • Triton

PyTorch scaled_dot_product_attention (SDPA):

  • Math
  • FA
  • Efficient

Performance

  • For standalone FA, use latest rocm. The built against rocm/6.3 is 1.5x faster than that with rocm/6.1 for certain inputs FA2

  • For PyTorch SDPA, use FA or Efficient backend
    SDPA

X-MoE

Official X-MoE Code and Documentation

https://github.com/Supercomputing-System-AI-Lab/X-MoE

Preparing Conda Environment

module reset
module load cpe/24.11
module load PrgEnv-gnu/8.6.0
module load rocm/6.3.1
module load craype-accel-amd-gfx90a
module load miniforge3/23.11.0-0


# cd to your directory of choice,
# I recommend using /lustre/orion directories as these packages require lots of space.

conda create -p $PWD/TORCH-ROCM6.3.1_env python=3.11 -c conda-forge -y
source activate $PWD/TORCH-ROCM6.3.1_env

Installing Torch

# ROCM 6.3.1
pip3 install --pre torch torchvision torchaudio --index-url
https://download.pytorch.org/whl/nightly/rocm6.3

Installing APEX

git clone https://github.com/ROCm/apex.git
cd apex
pip install -r requirements.txt
python setup.py install --cpp_ext --cuda_ext

Installing FlashAttention

Didn't work with pip install, so tried to install from source instead.

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/
python setup.py install
pytest -q -s tests/test_flash_attn.py

Installing X-MoE

Worked as is.

cd ~
git clone https://github.com/Supercomputing-System-AI-Lab/X-MoE
cd X-MoE
git submodule update --init --recursive --remote

pip install -e .
cd Megatron-DeepSpeed-X-MoE && pip install -e .

Data Preparation

Recommending removing   from XMoE's parent's directory. Worked as is.

Maybe make the amount of data a variable so that we don't have to go through the entire dataset.

Training on Single Node

Didn't work on login node. Looks like this depends on having a host list.

scontrol: error: host list is empty
first=
ssh: Could not resolve hostname : Name or service not known
MASTER_ADDR=

Getting an Interactive Node

    salloc -A PROJID -J RunSim123 -t 0:30:00 -p batch -N 1

Running the Training

    ./X-MoE-Small-node-1.sh 8 1

Frontier pytorch $cpp_extension$ related problem, "ninja -v" needs to be replaced with

    ninja --version

.

Need to install mpi4py

    MPICC="cc -shared" pip install --no-cache-dir --no-binary=mpi4py mpi4py

Replaced torchrun with srun

    time srun -u -N 1 -n${NUM_GPUS} -c2 --ntasks-per-node=8
    --gpus-per-node=8 --gpu-bind=closest python pretrain_gpt_deepspeed.py ...

Error Logs

./X-MoE-Small-node-1-srun.sh 8 1

srun_logs_with_jit_error.log

     ImportError: /lustre/orion/stf218/world-shared/sajal/testing_xmoe/
     X-MoE/Megatron-DeepSpeed-X-MoE/
     megatron/fused_kernels/build/scaled_upper_triang_masked_softmax_cuda.so:
     cannot open shared object file: No such file or directory

Using Correct Versions of Compiler

    export CXX=/opt/cray/pe/gcc-native/13/bin/g++
    export CC=/opt/cray/pe/gcc-native/13/bin/gcc
    export PATH=/opt/cray/pe/gcc-native/13/bin:$PATH

About

AI related notes on Frontier supercomputer at OLCF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5