An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

liligwu · 2021-11-29T23:59:51Z

When running mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx, an error is thrown:

Traceback (most recent call last): File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module> main() File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main comms_main() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main collBenchObj.runBench(comms_world_info, commsParams) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench backendObj.benchmark_comms() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime self.reportBenchTime( File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime self.reportBenchTimeColl(commsParams, results, tensorList) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl latencyAcrossRanks = np.array(tensorList) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

The text was updated successfully, but these errors were encountered:

nrsatish · 2021-12-07T23:26:31Z

@samiwilf can you take a look?

samiwilf · 2021-12-08T00:39:08Z

This issue will be resolved by #68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

liligwu commented Nov 29, 2021

nrsatish commented Dec 7, 2021

samiwilf commented Dec 8, 2021

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

Comments

liligwu commented Nov 29, 2021

nrsatish commented Dec 7, 2021

samiwilf commented Dec 8, 2021