We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
Inf1 failed to execute the model after a long time. Here is the logs:
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:tpbs_infer_lock Failed to start model 2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_infer Failed to acquire infer locks [2024-11-06 06:40:46,642][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded [2024-11-06 06:40:46,642][pid=27][WARNING] re-try times 1 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:tpbs_infer_lock Failed to start model 2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_infer Failed to acquire infer locks [2024-11-06 06:40:46,661][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded [2024-11-06 06:40:46,661][pid=27][WARNING] re-try times 1 [2024-11-06 06:40:52,030][pid=28][INFO] No message. Put worker to sleep for a while... [2024-11-06 06:40:58,041][pid=28][INFO] current sqs message num is 0 [2024-11-06 06:41:08,115][pid=28][INFO] No message. Put worker to sleep for a while... [2024-11-06 06:41:14,125][pid=28][INFO] current sqs message num is 0 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:2) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:0) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:1) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2) 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (30000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification 2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:notification_consume_error_block Error notifications found on nd0 nc1; action=INFER_ERROR_SUBTYPE_MODEL; error_id=8; error string:Event double set 2024-Nov-06 06:41:16.0646 2895:2895 ERROR NMGR:dlr_infer Inference completed with err: 5 [2024-11-06 06:41:16,650][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded , trace back log Traceback (most recent call last): File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__ return self.forward(*xargs, **kwargs) File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward ret = self.model(*args) File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward model = _NeuronGraph_1981.model _337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24] _338 = ops.neuron.forward_v2_1(_337, model) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return _338 Traceback of TorchScript, original code (most recent call last): /root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op /root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__ /root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward /root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module /root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse /root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph /root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard /root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace /root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run /data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module> RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hi,
Inf1 failed to execute the model after a long time. Here is the logs:
The text was updated successfully, but these errors were encountered: