-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RT1170 enhancements #2865
base: master
Are you sure you want to change the base?
RT1170 enhancements #2865
Conversation
CFLAGS += \ | ||
-D__STARTUP_CLEAR_BSS \ | ||
-DCFG_TUSB_MCU=OPT_MCU_MIMXRT1XXX \ | ||
-DCFG_TUSB_MEM_SECTION='__attribute__((section("NonCacheable")))' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On M7 core NonCacheable
is located on DTCM so there is no need to add a if switch.
Hi @mastupristi, I've managed to run TinyUSB stack on M4 core. The DMA controller inside USB IP can't access M4 core's TCM so packet buffer must be placed in OCRAM, it's done by |
Hi @HiFiPhile
Just wanted to share some great news—I ran an initial test ( Thanks so much for putting this together so quickly. We’re thrilled with the progress and super grateful for your help. |
as mentioned in esp32p4 cache, I would still wnat to keep the dcache clean/invidate. IMO, it does not hurt performance but actual improve it (depending on the usage). As one of the main difference between M7 and M3/M4 is actually the cache (data + instruction). M7 can run insanely fast (up to 1Ghz), and can perform lots of computattion on data e.g video before passing it to USB/DMA for transfer. |
I was thinking about add back cache support for M7 core but didn't have the time. Secondary M4 core uses a customized cache controller which is more complicated.
I did a little test based on
code
volatile uint32_t clock_cycles_counter;
volatile unsigned int *DWT_CYCCNT = (uint32_t *)0xE0001004; //address of the register
volatile unsigned int *DWT_CONTROL = (uint32_t *)0xE0001000; //address of the register
volatile unsigned int *SCB_DEMCR = (uint32_t *)0xE000EDFC; //address of the register
uint16_t i2s_dummy_buffer2[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
CFG_TUSB_MEM_SECTION uint16_t i2s_dummy_buffer3[CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO][CFG_TUD_AUDIO_FUNC_1_N_CHANNELS_TX*CFG_TUD_AUDIO_FUNC_1_SAMPLE_RATE/1000/CFG_TUD_AUDIO_FUNC_1_N_TX_SUPP_SW_FIFO];
// in main()
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer3, i2s_dummy_buffer, sizeof(i2s_dummy_buffer3));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
// Test cached location again to ensure memcpy is in ICACHE
clock_cycles_counter = 0;
*SCB_DEMCR = *SCB_DEMCR | 0x01000000;
*DWT_CYCCNT = 0;
*DWT_CONTROL |= 1;
memcpy(i2s_dummy_buffer2, i2s_dummy_buffer, sizeof(i2s_dummy_buffer2));
SCB_CleanDCache_by_Addr((uint32_t*)i2s_dummy_buffer2, sizeof(i2s_dummy_buffer2));
*DWT_CONTROL &= ~1;
clock_cycles_counter = *DWT_CYCCNT;
asm("bkpt 0x55"); Although OCRAM is only clocked at fCPU/4 (for STM32H7 is fCPU/2), performance on non cached location is higher.
Still it is slow to copy 384 bytes, I expect ~500 cycles to copy 96 words counting 4 cycles per word. |
@HiFiPhile thanks for the detailed test ersult, though this may not reflect all usage. Dcache clean/invalidate as any solution does introduce overhead, in a scenario when user need to do heavy computation on memory such as encrypting a large block of bytes and/or lots of video/dsp processing. It would outweight the overhead. As general rule of thumb for cpu world, I still think the more cache the better/faster in general. |
Anyway for most applications TCM is used as the buffer, as the default linker script. In a later test when both buffer are inside TCM I got a result of less than 500 cycles, while testing from OCRAM without cache clean still cost 900 cycles. |
I have no idea though, tbh I am new to these dcache as well. These mpu configuration is also bit complicated for me :) |
Describe the PR
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBOARD=mimxrt1170_evkb -DM4=1 -G Ninja -B rt1170_cm4
make BOARD=mimxrt1170_evkb M4=1
PS: MCHP has nice write-up on cache https://ww1.microchip.com/downloads/en/DeviceDoc/Managing-Cache-Coherency-on-Cortex-M7-Based-MCUs-DS90003195A.pdf