Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RT1170 enhancements #2865

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

RT1170 enhancements #2865

wants to merge 4 commits into from

Conversation

HiFiPhile
Copy link
Collaborator

@HiFiPhile HiFiPhile commented Nov 2, 2024

Describe the PR

  • Replace cache clean/invalidate by MPU config. Since we can't guarantee buffer sizes are multiple of cache line size, doing cache clean/invalidate can cause data consistency issue, it also hurts performance. Use MPU to set RAM as non-cacheable like mcux-sdk example.
  • Add M4 core ram image build support
    • cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBOARD=mimxrt1170_evkb -DM4=1 -G Ninja -B rt1170_cm4
    • make BOARD=mimxrt1170_evkb M4=1

PS: MCHP has nice write-up on cache https://ww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Coherency-on-Cortex-M7-Based-MCUs-DS90003195A.pdf

CFLAGS += \
-D__STARTUP_CLEAR_BSS \
-DCFG_TUSB_MCU=OPT_MCU_MIMXRT1XXX \
-DCFG_TUSB_MEM_SECTION='__attribute__((section("NonCacheable")))' \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On M7 core NonCacheable is located on DTCM so there is no need to add a if switch.

@HiFiPhile
Copy link
Collaborator Author

Hi @mastupristi,

I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by CFG_TUSB_MEM_SECTION=__attribute__((section("NonCacheable"))), it is a section defined by default linker script.
Also memory section attribute is configured by BOARD_ConfigMPU();

@mastupristi
Copy link

Hi @HiFiPhile

I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by CFG_TUSB_MEM_SECTION=__attribute__((section("NonCacheable"))), it is a section defined by default linker script. Also memory section attribute is configured by BOARD_ConfigMPU();

Just wanted to share some great news—I ran an initial test (cdc_msc example) of your branch for the CM4 on the RT1170, and it worked like a charm! 🎉

Here what my kernel say:
image

Thanks so much for putting this together so quickly. We’re thrilled with the progress and super grateful for your help.
My colleagues and I will also try to review the code soon

@hathach
Copy link
Owner

hathach commented Nov 27, 2024

as mentioned in esp32p4 cache, I would still wnat to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage). As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computattion on data e.g video before passing it to USB/DMA for transfer.
@HiFiPhile If you are busy, I can make pull and make the change myself later on when I got time.

@HiFiPhile
Copy link
Collaborator Author

as mentioned in esp32p4 cache, I would still want to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage).

I was thinking about add back cache support for M7 core but didn't have the time. Secondary M4 core uses a customized cache controller which is more complicated.

As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computation on data e.g video before passing it to USB/DMA for transfer.

I did a little test based on audio_4_channel_mic example:

  • ICache and DCache are default ON
  • Change the linker to put data on OCRAM (default is DTCM)
  • Copy i2s_dummy_buffer to cached location and clean cache
  • Copy i2s_dummy_buffer to non cached location
  • Compare cycle counter
code

volatile uint32_t clock_cycles_counter;
volatile unsigned int *DWT_CYCCNT = (uint32_t *)0xE0001004; //address of the register
volatile unsigned int *DWT_CONTROL = (uint32_t *)0xE0001000; //address of the register
volatile unsigned int *SCB_DEMCR = (uint32_t *)0xE000EDFC; //address of the register

uint16_t i2s_dummy_buffer2[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
CFG_TUSB_MEM_SECTION uint16_t i2s_dummy_buffer3[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];

  // in main()
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
  
  SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer3, i2s_dummy_buffer, sizeof(i2s_dummy_buffer3));
  
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  // Test cached location again to ensure memcpy is in ICACHE
  clock_cycles_counter = 0;
  *SCB_DEMCR = *SCB_DEMCR | 0x01000000;
  *DWT_CYCCNT = 0;
  *DWT_CONTROL |=  1;
  
  memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
  
  SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
  *DWT_CONTROL &= ~1;
  clock_cycles_counter = *DWT_CYCCNT;
  
  asm("bkpt 0x55");

Although OCRAM is only clocked at fCPU/4 (for STM32H7 is fCPU/2), performance on non cached location is higher.

Cached Non-cached Cached 2nd loop
1121 963 1036

Still it is slow to copy 384 bytes, I expect ~500 cycles to copy 96 words counting 4 cycles per word.

@hathach
Copy link
Owner

hathach commented Nov 28, 2024

@HiFiPhile thanks for the detailed test ersult, though this may not reflect all usage. Dcache clean/invalidate as any solution does introduce overhead, in a scenario when user need to do heavy computation on memory such as encrypting a large block of bytes and/or lots of video/dsp processing. It would outweight the overhead. As general rule of thumb for cpu world, I still think the more cache the better/faster in general.

@HiFiPhile
Copy link
Collaborator Author

Anyway for most applications TCM is used as the buffer, as the default linker script.

In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles.
I'm curious to find out why cache is slower than TCM in this case.

@hathach
Copy link
Owner

hathach commented Nov 28, 2024

Anyway for most applications TCM is used as the buffer, as the default linker script.

In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles. I'm curious to find out why cache is slower than TCM in this case.

I have no idea though, tbh I am new to these dcache as well. These mpu configuration is also bit complicated for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants