Skip to content

Error related to magma during training #101

@Cdk29

Description

@Cdk29

@henry090 : I am trying to train a xse_resnet50.

During training I got the following error :

R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.

It is not a simple out of memory error, it seems to be some kind of memory leak related to magma, similar to related here.

But : I did not find any mention of this bug occurring with fastai, which I would have expect if this thing occurred recurrently, except for this message on this thread : https://forums.fast.ai/t/a-walk-with-fastai2-vision-study-group-and-online-lectures-megathread/59929/1293 :

The only “new” thing I am doing is that I am encapsulating most of my code for training the model in a try/except block in a while loop.

I wonder if the memory leak is not somehow due to using a function as a wrapper or reticulate.
Link towards the code and error : https://www.kaggle.com/cdk292/magma-error-xse-resnext50-with-r?scriptVersionId=50229515

The last version is still compiling but you can see in the log of execution of version 4 the error, and will probably shown up again in V6.

PS : merry Christmas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions