Skip to content

[BUG] <Brief Description> #520

@doudouodong

Description

@doudouodong

Checkpoint provides 80 classes but module expects 2. Reinitializing denoising class embed.
Training for 3000 steps...
Logging every 100 steps.
Validating every 1000 steps.
Saving checkpoints every 1000 steps.
Train Step 1/3000 | Train Loss: 30.2364
Train Step 100/3000 | Train Loss: 30.6223
Train Step 200/3000 | Train Loss: 24.7357
Train Step 300/3000 | Train Loss: 26.9152
Train Step 400/3000 | Train Loss: 30.5777
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1251, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 440, in _poll
r = wait([self], timeout)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 948, in wait
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1055) is killed by signal: Killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions