-
Notifications
You must be signed in to change notification settings - Fork 13
Description
@henry090 : I am trying to train a xse_resnet50.
During training I got the following error :
R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
It is not a simple out of memory error, it seems to be some kind of memory leak related to magma, similar to related here.
But : I did not find any mention of this bug occurring with fastai, which I would have expect if this thing occurred recurrently, except for this message on this thread : https://forums.fast.ai/t/a-walk-with-fastai2-vision-study-group-and-online-lectures-megathread/59929/1293 :
The only “new” thing I am doing is that I am encapsulating most of my code for training the model in a try/except block in a while loop.
I wonder if the memory leak is not somehow due to using a function as a wrapper or reticulate.
Link towards the code and error : https://www.kaggle.com/cdk292/magma-error-xse-resnext50-with-r?scriptVersionId=50229515
The last version is still compiling but you can see in the log of execution of version 4 the error, and will probably shown up again in V6.
PS : merry Christmas.