Skip to content

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

@liligwu

Description

@liligwu

When running mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx, an error is thrown:

Traceback (most recent call last): File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module> main() File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main comms_main() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main collBenchObj.runBench(comms_world_info, commsParams) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench backendObj.benchmark_comms() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime self.reportBenchTime( File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime self.reportBenchTimeColl(commsParams, results, tensorList) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl latencyAcrossRanks = np.array(tensorList) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions