-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
Description
Some of the validators are getting CUDA OOM every now and then (including the test validator).
https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro
My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.