An error is thrown when running run_dlrm_ubench_train_allreduce.sh

When running `mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx`, an error is thrown:

`Traceback (most recent call last):
  File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module>
    main()
  File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main
    comms_main()
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench
    backendObj.benchmark_comms()
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime
    self.reportBenchTime(
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime
    self.reportBenchTimeColl(commsParams, results, tensorList)
  File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl
    latencyAcrossRanks = np.array(tensorList)
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions