Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator show zb_assert at ??:? after some time ... (TZ-651) #255

Open
3 tasks done
Zeboo14 opened this issue Feb 21, 2024 · 13 comments
Open
3 tasks done

Coordinator show zb_assert at ??:? after some time ... (TZ-651) #255

Zeboo14 opened this issue Feb 21, 2024 · 13 comments
Labels

Comments

@Zeboo14
Copy link

Zeboo14 commented Feb 21, 2024

Answers checklist.

  • I have read the documentation ESP Zigbee SDK Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) and ESP Zigbee libs (esp-zboss-lib and esp-zigbee-lib) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5,1

esp-zigbee-lib version.

1.1.1

esp-zboss-lib version.

1.1.1

Espressif SoC revision.

ESPC32C6

What is the expected behavior?

We have a coordinator connected to ~25 ZR (router devices). Every 500ms-1s, a small message is sent to custom cluster, both directions, using esp_zb_zcl_custom_cluster_cmd_req(). Can work fine for hours but sometimes, hit a zb_assert (average of 4 times per day). Can you give us indication on the conditions that what would result in this zb_assert()?

What is the actual behavior?

abort() was called at PC 0x4202b5e7 on core 0
0x4202b5e7: zb_assert at ??:?

Core 0 register dump:
MEPC : 0x4080099e RA : 0x408114ec SP : 0x4084e670 GP : 0x408229d0
0x4080099e: panic_abort at D:/software/espressif/esp-idf/components/esp_system/panic.c:452

0x408114ec: __ubsan_include at D:/software/espressif/esp-idf/components/esp_system/ubsan.c:312

TP : 0x40838080 T0 : 0x37363534 T1 : 0x7271706f T2 : 0x33323130
S0/FP : 0x4084e6a0 S1 : 0x00000001 A0 : 0x4084e6f8 A1 : 0x4084e6ea
A2 : 0x00000000 A3 : 0x4084e725 A4 : 0x4084e6f8 A5 : 0x00000000
A6 : 0x00000000 A7 : 0x76757473 S2 : 0x00000009 S3 : 0x00000009
S4 : 0x00000000 S5 : 0x40829706 S6 : 0x00000000 S7 : 0x00000000
S8 : 0x00000000 S9 : 0x00000000 S10 : 0x00000000 S11 : 0x00000000
T3 : 0x6e6d6c6b T4 : 0x6a696867 T5 : 0x66656463 T6 : 0x62613938
MSTATUS : 0x00001881 MTVEC : 0x40800001 MCAUSE : 0x00000007 MTVAL : 0x00000000
0x40800001: _vector_table at ??:?

MHARTID : 0x00000000

Stack memory:
4084e670: 0x6e6d6c6b 0x6a696867 0x66656463 0x4084e6f8 0x00000001 0x00000000 0x33323130 0x9280f6fa
4084e690: 0x62613938 0x66656463 0x6a696867 0x4084e6d0 0x7271706f 0x76757473 0x7a797877 0x4084e6f8
4084e6b0: 0x00000009 0x00000001 0x4084e6e8 0x9280f6fa 0x00000008 0x4084e6f8 0x4084e740 0x4081eed0
0x4081eed0: ra_to_str at D:/software/espressif/esp-idf/components/newlib/assert.c:24

4084e6d0: 0x00000004 0x4084e6f8 0x4082378c 0x4084e6ec 0x408237a8 0x4084e6e8 0x40840030 0x32303234
4084e6f0: 0x37653562 0x00000000 0x726f6261 0x20292874 0x20736177 0x6c6c6163 0x61206465 0x43502074
4084e710: 0x34783020 0x62323032 0x20376535 0x63206e6f 0x2065726f 0x00000030 0x00000000 0x9280f6fa
4084e730: 0x00000000 0x40829706 0x408296a8 0x4202b5ea 0x00000001 0x00000000 0x4084e778 0x4202ae24
0x4202b5ea: zb_globals_init at ??:?

0x4202ae24: zb_bufpool_storage_free at ??:?

4084e750: 0x00000089 0x00000001 0x408296a8 0x4202a1e6 0x00000001 0x00000000 0x408296b4 0x4203d15c
0x4202a1e6: zb_buf_free_func at ??:?

0x4203d15c: zb_mlme_beacon_notify_indication at ??:?

4084e770: 0x00000000 0x00000000 0x4082cd36 0x40816c34 0x00000000 0x00000001 0x00000001 0x00000000
0x40816c34: vPortExitCritical at D:/software/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/riscv/port.c:339

4084e790: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000001 0x42030d28
0x42030d28: zb_sched_loop_iteration at ??:?

4084e7b0: 0x4083547c 0x4203ccac 0x40090012 0x9280f6fa 0x00000000 0x00000000 0x00000000 0x00000000
0x4203ccac: zb_mlme_beacon_notify_indication at ??:?

4084e7d0: 0x00000000 0x00000000 0x4084e880 0x42057562 0x00000000 0x00000084 0x5264c0f3 0x4201e6e2
0x42057562: zboss_main_loop_iteration at ??:?

0x4201e6e2: esp_zb_main_loop_iteration at ??:?

4084e7f0: 0x00000000 0x00000000 0x4084e880 0x420135a0 0x420132f6 0x00000000 0x00000000 0x40832f64
0x420135a0: esp_zb_task_zc at C:/d/software/rfin_pro_git/rfin_pro/main/zb_coord.c:866 (discriminator 1)

0x420132f6: esp_zb_task_zc at C:/d/software/rfin_pro_git/rfin_pro/main/zb_coord.c:768

4084e810: 0x40838080 0x00000000 0x00000000 0x00000000 0x4087ce50 0x40835820 0x40835868 0x40835984
4084e830: 0x420c4ca4 0x00000000 0x40832f64 0x420122f4 0x00000001 0x420122a4 0x00000003 0x00000000
0x420122f4: zbProdTableTask at C:/d/software/rfin_pro_git/rfin_pro/main/zb_coord.c:200

0x420122a4: zbZcConnectInfoTask at C:/d/software/rfin_pro_git/rfin_pro/main/zb_coord.c:162

4084e850: 0x00000000 0x000000ff 0x00000000 0x54535251 0x38751755 0x54535251 0x38751755 0x9280f6fa
4084e870: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
4084e890: 0x00000000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xbaad5678 0x00004010
4084e8b0: 0xabba1234 0x408352ac 0x00004000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e8d0: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e8f0: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e910: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e930: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e950: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e970: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e990: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e9b0: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e9d0: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084e9f0: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084ea10: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084ea30: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
4084ea50: 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5

Steps to reproduce.

See above

More Information.

No response

@Zeboo14 Zeboo14 added the Bug label Feb 21, 2024
@github-actions github-actions bot changed the title Coordinator show zb_assert at ??:? after some time ... Coordinator show zb_assert at ??:? after some time ... (TZ-651) Feb 21, 2024
@Zeboo14
Copy link
Author

Zeboo14 commented Feb 26, 2024

More: with Zb trace enable, we see that the zb_assert is at:
I (3811569) ZBOSS: common/zb_debug.c:64 Assertion failed /home/xiaqilin/esp/esp-zboss/components/zboss_stack/zboss/common/zb_bufpool_mult.c:1278
OR:
I (1583330) ZBOSS: common/zb_debug.c:64 Assertion failed /home/xiaqilin/esp/esp-zboss/components/zboss_stack/zboss/common/zb_bufpool_mult_storage.c:105

Can you share info on what would cause those two asserts?

Something wrong with internal ZBOSS buffer pools?
Something wrong with my config?

@xieqinan
Copy link
Contributor

Hello @Zeboo14,

Could you please provide the trace log with complete information that would be more helpful?

I (3811569) ZBOSS: common/zb_debug.c:64 Assertion failed /home/xiaqilin/esp/esp-zboss/components/zboss_stack/zboss/common/zb_bufpool_mult.c:1278

From this log, I can only deduce that the stack received a malformed frame, causing it to fail to distinguish the layer header or extract the frame payload.

I (1583330) ZBOSS: common/zb_debug.c:64 Assertion failed /home/xiaqilin/esp/esp-zboss/components/zboss_stack/zboss/common/zb_bufpool_mult_storage.c:105

This log may indicate that an illegal buffer is being freed.

Therefore, a complete crashing trace log is necessary.

@xieqinan
Copy link
Contributor

@Zeboo14 ,

Furthmore, if there are other RTOS tasks in your project, please consider using esp_zb_lock_acquire() and esp_zb_lock_release() to ensure thread safety.

@xieqinan
Copy link
Contributor

@Zeboo14

Furthmore, if there are other RTOS tasks in your project, please consider using esp_zb_lock_acquire() and esp_zb_lock_release() to ensure thread safety.

Does this method work for you? Do you have any other updates on this issue?

@janjaeger
Copy link

I see a very similar issue. This happens on both endpoint and router connected to a Philips Hue bridge. There is no apparent activity related to the endpoint, but once so often I get the below, where it looks indeed like some frame was received which ultimately lead to the error

esp32h2> I (766) zigbee.hpp:15: Zigbee stack initialized type(ZR)
W (776) zigbee.hpp:91: Network(0x7322) closed, devices joining not allowed.
I (776) zigbee.hpp:21: Device started up in non factory-reset mode status: ESP_OK
I (776) zigbee.hpp:30: Device NWK(0x7d94) restarted
Zigbee stack assertion failed common/zb_bufpool_mult.c:1321
abort() was called at PC 0x4202b32b on core 0
Backtrace: 0x40800800:0x40820a30 0x40807f30:0x40820a30 0x4080d312:0x40820a40 0x4202b32c:0x40820ab0
Backtrace ended abruptly: cannot find DWARF information for instruction at address 0x4202b32c
ELF file SHA256: 74288643b

4202b32e t zb_aps_multi_pre_link_key_init
4080d330 t ra_to_str
40807f34 t prvGetDisinheritPriorityAfterTimeout
40800806 t panic_enable_cache

@xieqinan
Copy link
Contributor

@janjaeger ,

but once so often I get the below, where it looks indeed like some frame was received which ultimately lead to the error

Based on the log, the crash seems to be triggered by a malformed frame from the air. Could you please provide the reproduction steps or a Wireshark packet capture to help us verify this issue?

@chshu
Copy link
Collaborator

chshu commented Sep 29, 2024

@Zeboo14 @janjaeger There have been several fixes and enhancements in recent releases, could you pelase test again with the latest esp-zigbee-sdk release. Please share the wireshark capture file if the issue still happens.

@janjaeger
Copy link

Hi @chshu , @Zeboo14 ,

The issue is still occurring (see below).
However, I have been able to isolate this to the groups and scene clusters.

I see this problem only when I have a scene or groups cluster (or both groups + scene) defined.
So, there seems to be something that both the groups and scene cluster have in common that causes this panic to occur.

This error does not occur when I have no scene or groups clusters defined

my definition of the scene and groups cluster are below.

Thanks and best regards,

Jan

  esp_zb_groups_cluster_cfg_t group_cfg = { .groups_name_support_id = ESP_ZB_ZCL_GROUPS_NAME_SUPPORT_DEFAULT_VALUE};
   esp_zb_attribute_list_t *groups_cluster = esp_zb_groups_cluster_create(&group_cfg);

   ESP_ERROR_CHECK(esp_zb_cluster_list_add_groups_cluster(cluster_list, groups_cluster, ESP_ZB_ZCL_CLUSTER_SERVER_ROLE));


   esp_zb_scenes_cluster_cfg_t scene_cfg = {
       .scenes_count = ESP_ZB_ZCL_SCENES_SCENE_COUNT_DEFAULT_VALUE,
       .current_scene = ESP_ZB_ZCL_SCENES_CURRENT_SCENE_DEFAULT_VALUE,
       .current_group = ESP_ZB_ZCL_SCENES_CURRENT_GROUP_DEFAULT_VALUE,
       .scene_valid = ESP_ZB_ZCL_SCENES_SCENE_VALID_DEFAULT_VALUE,
       .name_support = ESP_ZB_ZCL_SCENES_NAME_SUPPORT_DEFAULT_VALUE};
   esp_zb_attribute_list_t *scenes_cluster = esp_zb_scenes_cluster_create(&scene_cfg);

   ESP_ERROR_CHECK(esp_zb_cluster_list_add_scenes_cluster(cluster_list, scenes_cluster, ESP_ZB_ZCL_CLUSTER_SERVER_ROLE));

ZBOSS: common/zb_debug.c:64 Assertion failed /builds/thread_zigbee/esp-zboss/components/zboss_stack/zboss/common/zb_bufpool_mult.c:1321
Zigbee stack assertion failed common/zb_bufpool_mult.c:1321
abort() was called at PC 0x4202f731 on core 0
--- 0x4202f731: zb_assert at ??:?
Backtrace: 0x40802952:0x40823410 0x4080b91a:0x40823420 0x40812c90:0x40823440 0x4202f732:0x408234b0
--- 0x40802952: panic_abort at /COMPONENT_ESP_SYSTEM_DIR/panic.c:468
0x4080b91a: esp_system_abort at /COMPONENT_ESP_SYSTEM_DIR/port/esp_system_chip.c:92
0x40812c90: abort at /COMPONENT_NEWLIB_DIR/abort.c:38
0x4202f732: zb_assert at ??:?
Backtrace ended abruptly: cannot find DWARF information for instruction at address 0x4202f732
--- 0x4202f732: zb_assert at ??:?
ELF file SHA256: dcd70e457
Rebooting...

@janjaeger
Copy link

Hi @chshu ,

Some more info, I am trying to capture the packet that causes the error to occur, which is a bit cumbersome because 1) it does not happen that often, and 2) I and up with megabytes of trace data, which is then difficult to correlate

However, I also see the following message:

ZBOSS: nwk/nwk_main.c:2339 Drop packet with improper protocol version: 0x5
yet, I cannot find any protocol 5 in my pcap trace. I assume this is the protocol version as below, but that means that we do have an overlay somewhere

Frame Control Field: 0x1209, Frame Type: Command, Discover Route: Suppress, Security, Extended Source Command
    .... .... .... ..01 = Frame Type: Command (0x1)
    .... .... ..00 10.. = Protocol Version: 2
    .... .... 00.. .... = Discover Route: Suppress (0x0)
    .... ...0 .... .... = Multicast: False
    .... ..1. .... .... = Security: True
    .... .0.. .... .... = Source Route: False
    .... 0... .... .... = Destination: False
    ...1 .... .... .... = Extended Source: True
    ..0. .... .... .... = End Device Initiator: False

@janjaeger
Copy link

Some more analysis - The version 5 error occurs about 75 times in a 14 hour period- I have a full pcap trace from this entire period, and no v5 frame is recorded, all frames are v2

@janjaeger
Copy link

Hi @chshu ,

The panic consistently occurs after receiving either a 'move to color' or 'move to level' frame.
Screenshots of the frames attached.

Best regards,
Jan

move2level move2color

@xieqinan
Copy link
Contributor

xieqinan commented Dec 2, 2024

Hi @janjaeger ,

I have reviewed the information you provided, and the implementation of the group and scene cluster as well as the move to level command looks fine to me. Could you please provide a patch to reproduce this issue based on the example in the esp-zigbee-sdk?

@xieqinan
Copy link
Contributor

xieqinan commented Dec 2, 2024

@janjaeger ,

#255 (comment) , This is a normal log, and I don't think it is the root cause of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants