Can’t Continue the training with the checkpoint in distributed manner ！！！

My dataset consists of 8 thousand grayscale images of 256 * 256 size，the follow is my train script:

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
```
DIFFUSION_FLAGS="--diffusion_steps 1000 \
                --noise_schedule cosine \
                --use_kl True"

TRAIN_FLAGS="--lr 1e-4 --batch_size 8"
export OPENAI_LOGDIR=XXXX

NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8

MASTER_PORT=$(python -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(('',0)); print(s.getsockname()[1]); s.close()")
export MASTER_ADDR=localhost
export MASTER_PORT=$MASTER_PORT  

NUM_GPUS="2"
mpiexec -n $NUM_GPUS python image_train.py --data_dir ./data/XXX $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS --resume_checkpoint ./training_log/CREMI/model039000.pt
```
Strangely, when I do not specify checkpoint (i. e., without the resume_checkpoint command), the model can run normally on two V100s, but when I try to join checkpoint to continue training, the model makes an error


![image](https://github.com/user-attachments/assets/df714e25-5c58-4483-ac87-1146e94cf546)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions