-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Hi
Thanks for the package - I can get the CPU Docker image to run fine, but the GPU Docker image crashes out. I'm using Singularity on a GPU server:
singularity pull deepgs.sif docker://malab/deepgs_gpu
salloc -p gpuq-dev --gres=gpu:1 -t 60:00
srun singularity exec --nv deepgs.sif nvidia-smi # works fine, prints the nvidia GPU info panel
srun singularity exec --nv deepgs.sif R --vanilla
library(DeepGS)
data(wheat_example)
Markers <- wheat_example$Markers
y <- wheat_example$y
cvSampleList <- cvSampleIndex(length(y),10,1)
# cross validation set
cvIdx <- 1
trainIdx <- cvSampleList[[cvIdx]]$trainIdx
testIdx <- cvSampleList[[cvIdx]]$testIdx
trainMat <- Markers[trainIdx,]
trainPheno <- y[trainIdx]
validIdx <- sample(1:length(trainIdx),floor(length(trainIdx)*0.1))
validMat <- trainMat[validIdx,]
validPheno <- trainPheno[validIdx]
trainMat <- trainMat[-validIdx,]
trainPheno <- trainPheno[-validIdx]
conv_kernel <- c("1*18") ## convolution kernels (fileter shape)
conv_stride <- c("1*1")
conv_num_filter <- c(8) ## number of filters
pool_act_type <- c("relu") ## active function for next pool
pool_type <- c("max") ## max pooling shape
pool_kernel <- c("1*4") ## pooling shape
pool_stride <- c("1*4") ## number of pool kernerls
fullayer_num_hidden <- c(32,1)
fullayer_act_type <- c("sigmoid")
drop_float <- c(0.2,0.1,0.05)
cnnFrame <- list(conv_kernel =conv_kernel,conv_num_filter = conv_num_filter,
conv_stride = conv_stride,pool_act_type = pool_act_type,
pool_type = pool_type,pool_kernel =pool_kernel,
pool_stride = pool_stride,fullayer_num_hidden= fullayer_num_hidden,
fullayer_act_type = fullayer_act_type,drop_float = drop_float)
markerImage = paste0("1*",ncol(trainMat))
trainGSmodel <- train_deepGSModel(trainMat = trainMat,trainPheno = trainPheno,
validMat = validMat,validPheno = validPheno, markerImage = markerImage,
cnnFrame = cnnFrame,device_type = "cpu",gpuNum = 1, eval_metric = "mae",
num_round = 6000,array_batch_size= 30,learning_rate = 0.01,
momentum = 0.5,wd = 0.00001, randomseeds = 0,initializer_idx = 0.01,
verbose = TRUE)
then this error happens:
Loading required namespace: mxnet Failed with error: ‘.onLoad failed in loadNamespace() for 'mxnet', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so': /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short’ Loading required package: mxnet Error: package or namespace load failed for ‘mxnet’: .onLoad failed in loadNamespace() for 'mxnet', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so': /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short Error in train_deepGSModel(trainMat = trainMat, trainPheno = trainPheno, : object 'mx.metric.mae' not found Execution halted
I checked, the file /usr/lib/x86_64-linux-gnu/libcuda.so.1 comes from the Docker image, it is a symbolic link to the file /usr/lib/x86_64-linux-gnu/libcuda.so.390.59, and that file has size 0. Do you have the Dockerfile anywhere? I think I can fix it with that
Edit: Minimum example for me:
srun --pty singularity exec --nv deepgs.sif R --vanilla > library(mxnet) Error: package or namespace load failed for ‘mxnet’: .onLoad failed in loadNamespace() for 'mxnet', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object '/usr/local/lib/R/site-library/mxnet/libs/libmxnet.so': /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short
Edit 2: Marco De La Pierre from the Pawsey Supercomputing Centre gave me a great workaround:
Get a GPU:
salloc -p gpuq -t 1:00:00 --gres=gpu:1
Make a fake home:
mkdir fake_home
Then manually give the system's libcuda locations to the Singularity/Docker image:
srun --pty singularity exec --nv -B fake_home:/home/data -B /usr/lib64/libcuda.so:/usr/lib/x86_64-linux-gnu/libcuda.so,/usr/lib64/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1,/usr/lib64/libcuda.so.440.33.01:/usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01 deepgs_gpu.sif R --vanilla
Then things like library(mxnet) work fine, because it doesn't use the empty Docker-image's libcuda, but the system's libcuda. That works and should work for others. I reckon people using Docker need to use -v instead of Singularity's -B.