Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inf1] RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded #1028

Open
PigletOS opened this issue Nov 6, 2024 · 0 comments
Labels
bug Something isn't working Inf1

Comments

@PigletOS
Copy link

PigletOS commented Nov 6, 2024

Hi,

Inf1 failed to execute the model after a long time. Here is the logs:

2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:40:46.0629 2895:2895 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:tpbs_infer_lock Failed to start model
2024-Nov-06 06:40:46.0629 2895:2895 ERROR NMGR:dlr_infer Failed to acquire infer locks
[2024-11-06 06:40:46,642][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
[2024-11-06 06:40:46,642][pid=27][WARNING] re-try times 1
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:40:46.0647 2894:2894 ERROR TDRV:consume_model_start_extra_notifications_v1(FATAL-RT-UNDEFINED-STATE) model start timeout (2000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_kelf_start_no_lock Model (1001) start failed for VNC=0, ret: 5
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:tpbs_infer_lock Failed to start model
2024-Nov-06 06:40:46.0647 2894:2894 ERROR NMGR:dlr_infer Failed to acquire infer locks
[2024-11-06 06:40:46,661][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace

/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
[2024-11-06 06:40:46,661][pid=27][WARNING] re-try times 1
[2024-11-06 06:40:52,030][pid=28][INFO] No message. Put worker to sleep for a while...
[2024-11-06 06:40:58,041][pid=28][INFO] current sqs message num is 0
[2024-11-06 06:41:08,115][pid=28][INFO] No message. Put worker to sleep for a while...
[2024-11-06 06:41:14,125][pid=28][INFO] current sqs message num is 0
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:2)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:0)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (start:1)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (30000 ms) on Neuron Device 0 NC 1, waiting for execution completion notification
2024-Nov-06 06:41:16.0646 2895:2895 ERROR TDRV:notification_consume_error_block Error notifications found on nd0 nc1; action=INFER_ERROR_SUBTYPE_MODEL; error_id=8; error string:Event double set
2024-Nov-06 06:41:16.0646 2895:2895 ERROR NMGR:dlr_infer Inference completed with err: 5
[2024-11-06 06:41:16,650][pid=27][ERROR] error is The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
, trace back log Traceback (most recent call last):
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/engine.py", line 43, in __call__
return self.forward(*xargs, **kwargs)
File "/root/miniconda/lib/python3.9/site-packages/DeepEngine/engines/TorchNeuron.py", line 135, in forward
ret = self.model(*args)
File "/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_3435.py", line 373, in forward
model = _NeuronGraph_1981.model
_337 = [argument_2, argument_3, argument_4, argument_5, argument_6, _306, _321, argument_9, _336, argument_11, argument_12, _216, argument_14, argument_15, _261, _276, argument_18, argument_19, argument_20, argument_21, argument_22, argument_23, argument_24]
_338 = ops.neuron.forward_v2_1(_337, model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _338
Traceback of TorchScript, original code (most recent call last):
/root/miniconda/lib/python3.9/site-packages/torch/_ops.py(442): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(416): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(580): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(209): run_op
/root/miniconda/lib/python3.9/site-packages/torch_neuron/graph.py(198): __call__
/root/miniconda/lib/python3.9/site-packages/torch_neuron/runtime.py(69): forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/root/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py(1194): _call_impl
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(976): trace_module
/root/miniconda/lib/python3.9/site-packages/torch/jit/_trace.py(759): trace
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(324): tb_parse
/root/miniconda/lib/python3.9/site-packages/torch_neuron/tensorboard.py(550): tb_graph
/root/miniconda/lib/python3.9/site-packages/torch_neuron/decorators.py(526): maybe_generate_tb_graph_def
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(580): maybe_determine_names_from_tensorboard
/root/miniconda/lib/python3.9/site-packages/torch_neuron/convert.py(233): trace
/root/miniconda/lib/python3.9/site-packages/DeepEngine/tools/convert_onnx_to_torch_neuron.py(62): convert_onnx2neuron
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(24): run
/data1/zhuxinyu/GitLab/DeepEngine/scripts/convert/convert_neuron_from_s3.py(67): <module>
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded
@aws-taylor aws-taylor added Inf1 bug Something isn't working labels Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Inf1
Projects
None yet
Development

No branches or pull requests

2 participants