Skip to content

Problem with GPU docker image - empty libcuda? #12

@philippbayer

Description

@philippbayer

Hi

Thanks for the package - I can get the CPU Docker image to run fine, but the GPU Docker image crashes out. I'm using Singularity on a GPU server:

singularity pull deepgs.sif docker://malab/deepgs_gpu
salloc -p gpuq-dev --gres=gpu:1 -t 60:00
srun singularity exec --nv deepgs.sif nvidia-smi # works fine, prints the nvidia GPU info panel
srun singularity exec --nv deepgs.sif R --vanilla
library(DeepGS)
data(wheat_example)
Markers <- wheat_example$Markers
y <- wheat_example$y
cvSampleList <- cvSampleIndex(length(y),10,1)
# cross validation set
cvIdx <- 1
trainIdx <- cvSampleList[[cvIdx]]$trainIdx
testIdx <- cvSampleList[[cvIdx]]$testIdx
trainMat <- Markers[trainIdx,]
trainPheno <- y[trainIdx]
validIdx <- sample(1:length(trainIdx),floor(length(trainIdx)*0.1))
validMat <- trainMat[validIdx,]
validPheno <- trainPheno[validIdx]
trainMat <- trainMat[-validIdx,]
trainPheno <- trainPheno[-validIdx]
conv_kernel <- c("1*18") ## convolution kernels (fileter shape)
conv_stride <- c("1*1")
conv_num_filter <- c(8)  ## number of filters
pool_act_type <- c("relu") ## active function for next pool
pool_type <- c("max") ## max pooling shape
pool_kernel <- c("1*4") ## pooling shape
pool_stride <- c("1*4") ## number of pool kernerls
fullayer_num_hidden <- c(32,1)
fullayer_act_type <- c("sigmoid")
drop_float <- c(0.2,0.1,0.05)
cnnFrame <- list(conv_kernel =conv_kernel,conv_num_filter = conv_num_filter,
                 conv_stride = conv_stride,pool_act_type = pool_act_type,
                 pool_type = pool_type,pool_kernel =pool_kernel,
                 pool_stride = pool_stride,fullayer_num_hidden= fullayer_num_hidden,
                 fullayer_act_type = fullayer_act_type,drop_float = drop_float)

markerImage = paste0("1*",ncol(trainMat))
trainGSmodel <- train_deepGSModel(trainMat = trainMat,trainPheno = trainPheno,
                validMat = validMat,validPheno = validPheno, markerImage = markerImage,
                cnnFrame = cnnFrame,device_type = "cpu",gpuNum = 1, eval_metric = "mae",
                num_round = 6000,array_batch_size= 30,learning_rate = 0.01,
                momentum = 0.5,wd = 0.00001, randomseeds = 0,initializer_idx = 0.01,
                verbose = TRUE)

then this error happens:

Loading required namespace: mxnet
Failed with error:  ‘.onLoad failed in loadNamespace() for 'mxnet', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so':
  /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short’
Loading required package: mxnet
Error: package or namespace load failed for ‘mxnet’:
 .onLoad failed in loadNamespace() for 'mxnet', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so':
  /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short
Error in train_deepGSModel(trainMat = trainMat, trainPheno = trainPheno,  :
  object 'mx.metric.mae' not found
Execution halted

I checked, the file /usr/lib/x86_64-linux-gnu/libcuda.so.1 comes from the Docker image, it is a symbolic link to the file /usr/lib/x86_64-linux-gnu/libcuda.so.390.59, and that file has size 0. Do you have the Dockerfile anywhere? I think I can fix it with that

Edit: Minimum example for me:

srun --pty singularity exec --nv deepgs.sif R --vanilla
> library(mxnet)
Error: package or namespace load failed for ‘mxnet’:
 .onLoad failed in loadNamespace() for 'mxnet', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so':
  /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short

Edit 2: Marco De La Pierre from the Pawsey Supercomputing Centre gave me a great workaround:

Get a GPU:
salloc -p gpuq -t 1:00:00 --gres=gpu:1

Make a fake home:
mkdir fake_home

Then manually give the system's libcuda locations to the Singularity/Docker image:

 srun --pty singularity exec --nv -B fake_home:/home/data -B /usr/lib64/libcuda.so:/usr/lib/x86_64-linux-gnu/libcuda.so,/usr/lib64/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1,/usr/lib64/libcuda.so.440.33.01:/usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01    deepgs_gpu.sif  R --vanilla

Then things like library(mxnet) work fine, because it doesn't use the empty Docker-image's libcuda, but the system's libcuda. That works and should work for others. I reckon people using Docker need to use -v instead of Singularity's -B.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions