Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

Open
liligwu opened this issue Nov 29, 2021 · 2 comments
Open

An error is thrown when running run_dlrm_ubench_train_allreduce.sh #61

liligwu opened this issue Nov 29, 2021 · 2 comments

Comments

@liligwu
Copy link
Contributor

liligwu commented Nov 29, 2021

When running mpirun --allow-run-as-root -np 8 -N 8 --bind-to none ./run_dlrm_ubench_train_allreduce.sh -c xxxx, an error is thrown:

Traceback (most recent call last): File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 133, in <module> main() File "dlrm/ubench/dlrm_ubench_comms_driver.py", line 106, in main comms_main() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1208, in main collBenchObj.runBench(comms_world_info, commsParams) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1161, in runBench backendObj.benchmark_comms() File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/pytorch_dist_backend.py", line 659, in benchmark_comms self.commsParams.benchTime(index, self.commsParams, self) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 1128, in benchTime self.reportBenchTime( File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 853, in reportBenchTime self.reportBenchTimeColl(commsParams, results, tensorList) File "/dockerx/FAMbench/11292021/FAMBench/param/train/comms/pt/comms.py", line 860, in reportBenchTimeColl latencyAcrossRanks = np.array(tensorList) File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 723, in __array__ return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

@nrsatish
Copy link
Contributor

nrsatish commented Dec 7, 2021

@samiwilf can you take a look?

@samiwilf
Copy link
Contributor

samiwilf commented Dec 8, 2021

This issue will be resolved by #68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants