Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DCD_DWC2][ESP32P4][HS] Added cache synchronization #2877

Merged

Conversation

roma-jam
Copy link
Contributor

@roma-jam roma-jam commented Nov 14, 2024

Requirements

On ESP32P4 it is important to synchronize cache and memory during DMA transactions.
To use the DMA feature the following should be done:

  • Keep the buffers in section .dram1
  • Align the data or by 0x04 (word, S2/S3) or 0x40 (cache line size, P4)

Thus, the values CFG_TUSB_MEM_SECTION and CFG_TUSB_MEM_ALIGN should be provided, according to the target chip.

Description

Added cache synchronization macroses during xfer preparation/completion:

  • dsync_c2m(_addr, _size) - Synchronizing cache to memory
  • dsync_m2c(_addr, _size) - Synchronizing memory to cache

Related

@roma-jam roma-jam changed the title feature(dcd_dwc2): [ESP32P4] Added cache synchronization [ESP32P4] Added cache synchronization Nov 14, 2024
@roma-jam roma-jam changed the title [ESP32P4] Added cache synchronization [ESP32P4][HS] Added cache synchronization Nov 14, 2024
@roma-jam roma-jam changed the title [ESP32P4][HS] Added cache synchronization [DWC_DWC2][ESP32P4][HS] Added cache synchronization Nov 14, 2024
@roma-jam roma-jam changed the title [DWC_DWC2][ESP32P4][HS] Added cache synchronization [DCD_DWC2][ESP32P4][HS] Added cache synchronization Nov 14, 2024
@roma-jam roma-jam force-pushed the feature/esp32p4_dma_cache_syncronization branch 2 times, most recently from a88dd32 to 2a56780 Compare November 14, 2024 13:07
@roma-jam roma-jam marked this pull request as ready for review November 14, 2024 13:23
@hathach
Copy link
Owner

hathach commented Nov 15, 2024

this is great, give me a bit of time, I am currently in the middle of refactoring dcd dwc2 https://github.com/hathach/tinyusb/tree/enhance-dwc2-dcd

though we should name it dcd_dcache_clean/invalidate() https://github.com/hathach/tinyusb/blob/enhance-dwc2-dcd/src/device/dcd.h#L96 as other ports.

@hathach
Copy link
Owner

hathach commented Nov 18, 2024

@roma-jam just merged #2881 . It mostly clean up and move thing around without behavioral changes, could you mind updating this PR using latest master. Then I will check this out, I am still open to cache sync. FYI, chipidea hs https://github.com/hathach/tinyusb/blob/master/src/portable/chipidea/ci_hs/dcd_ci_hs.c (device) and ehci (host) are one of those port use dcache sync for imxrt (M7) for dtcm. Maybe you can take a peek at for reference.

@HiFiPhile
Copy link
Collaborator

@roma-jam just merged #2881 . It mostly clean up and move thing around without behavioral changes, could you mind updating this PR using latest master. Then I will check this out, I am still open to cache sync. FYI, chipidea hs https://github.com/hathach/tinyusb/blob/master/src/portable/chipidea/ci_hs/dcd_ci_hs.c (device) and ehci (host) are one of those port use dcache sync for imxrt (M7) for dtcm. Maybe you can take a peek at for reference.

You haven't slept yet ? 🤣

I prefer to use non-cacheable memory since it's clear, less dependant to architecture (like RT1170 core M4 has home made cache) and more error proof. Also much easier to setup without alignment and size constraints.

In #2865 I've converted to dcd_ci_hs to use MPU with updated BSP.

@hathach
Copy link
Owner

hathach commented Nov 18, 2024

I have just waked up, I have an early trip to visit relative today :). I still prefer to support both, once we got dcache sync work, there will be an CFG option to disble dcache function, user can then configure their own mpu as they prefer. Otherwise we will impose the mpu requirements on user, some may not be trivial for port such as ehci/ohci since it may also need virtual to physical as well since some of them are cpu

@roma-jam roma-jam force-pushed the feature/esp32p4_dma_cache_syncronization branch from 2a56780 to 8af3402 Compare November 19, 2024 11:34
#endif // DWC2_MEM_CACHE_LINE_SIZE

CFG_TUD_MEM_SECTION struct {
union {
Copy link
Contributor Author

@roma-jam roma-jam Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a manual tweak to protect* memory after the aligned _setup_packet buffer.
I have doubts, that __attribute__((aligned(x)) protects the memory after non-aligned variable, so I decided to solve it this way. At least there is nothing about that in the GCC doc.
Anyway, suggestions are welcome.

I referred to: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-aligned-variable-attribute

For class buffers (such as mscd cbw and csw, cdc epout and epin and so on, it is better to optimize in class driver code and different PR, IMHO). But anyway also should be done.

UPD:
*to protect from being vanished during cache sync operation

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will be a bit of an headache. gcc attribute alignment only ensure the variale alignement only, its size can be smaler, that would leave rom for other variable/data which can be corrupted when memroy to cache is called. I will try to figure out a way to address this.

@roma-jam
Copy link
Contributor Author

@hathach Thanks, I will check the imxrt (M7) code.
The initial idea I saw in here: 8765568#r1580181750 where is was removed, so I decided to implemented something alike, but just a bit simplified.

@roma-jam roma-jam marked this pull request as draft November 19, 2024 22:23
@hathach
Copy link
Owner

hathach commented Nov 20, 2024

@roma-jam would you giving me the write permission to your fork's branch PR. I have made some updated to the PR to get it compiling and have the DMA somewhat working based on you cach hint. I think the only issue now is the class driver and the m2c that corrupt the data after variable e.g (64-8) after setup when we sync. (these space can be occuiped by other system variable). I think we can introduce an CFG DCACHE LINE SIZE and use that to make sure all buffer (dcd to class driver is place correctly).

tinyusb: ERROR: Permission to roma-jam/tinyusb.git denied to hathach. Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

@roma-jam roma-jam force-pushed the feature/esp32p4_dma_cache_syncronization branch from 8af3402 to b8d31a5 Compare November 20, 2024 12:10
@roma-jam
Copy link
Contributor Author

@hathach,

I have renamed the calls, now they are dcd_dcache_clean/invalidate().
Also, there are several changes in the class driver regarding the buffer optimization, I made them here, you can check the commits from here: espressif#37

Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look good, thank you. However, I need to make a wider code to make dcache more generic and also add more CFG to make it easier for user to override it if needed. Since I cannot push to your fork directly, I push the changes to my repo and make an separate PR here #2883. When this is merge, gh is smaert enough to mark this as merged as well.

@hathach hathach merged commit b8d31a5 into hathach:master Nov 20, 2024
104 of 105 checks passed
@tore-espressif
Copy link
Contributor

tore-espressif commented Nov 22, 2024

Hi all, seems that I'm late for the party, I prepared a sum-up of the discussion, I will leave it here if anyone is ever interested in this thread:

Summary

  1. When the CPU accesses buffers in memory, we must:
    • Either perform explicit cache synchronization.
    • Or access the buffers via non-cacheable addresses (or place them in non-cacheable memory).
  2. Both approaches work for ESPs.
  3. In both cases, the buffer's address and size must be aligned to the cache line size.
  4. The cache line size for ESP32-P4 is configurable. 64 or 128 bytes.
  5. We prefer explicit cache synchronization, but we can accept alternative solutions for the sake of uniformity with other vendors. The reasons are as follows:
    • Code safety:
      If we decide to use non-cacheable addresses for CPU access, we must ensure that no part of the code accesses the same memory through cacheable addresses. This is particularly challenging given that TinyUSB is a third-party component, increasing the risk of regression.
    • Canonical approach:
      Explicit cache synchronization is the standard and expected solution for handling cache coherence. While this might seem like a weak argument, deviating from it could add to our technical debt. Additionally, this approach is more platform-independent, as it relies on standardized synchronization mechanisms provided by most architectures, whereas managing non-cacheable memory regions often involves platform-specific configurations.
    • Future-proofing:
      Accessing memory via non-cacheable addresses only works for internal RAM. External RAM, on the other hand, can only be accessed through the cache. While we currently do not support placing USB buffers in external RAM, this limitation would become a significant blocker in future implementations.

@hathach
Copy link
Owner

hathach commented Nov 22, 2024

Hi all, seems that I'm late for the party, I prepared a sum-up of the discussion, I will leave it here if anyone is ever interested in this thread:

Not at all, there is on-going follow up for dma (device stack), there will be a host PR afterwards

Summary

  1. When the CPU accesses buffers in memory, we must:

    • Either perform explicit cache synchronization.
    • Or access the buffers via non-cacheable addresses (or place them in non-cacheable memory).
  2. Both approaches work for ESPs.

  3. In both cases, the buffer's address and size must be aligned to the cache line size.

  4. The cache line size for ESP32-P4 is configurable. 64 or 128 bytes.

Yeah, we figure that out, it is similar to arm m7 dtcm, therefore we will follow that model. Note DMA is not enabled by default, it must be explicitly set with CFG_TUD_DWC2_DMA_ENABLE=1, CFG_TUD/TUH_MEM_ALIGN must also set accordingly to have correct cache line alignment. CFG_TUD_MEM_DCACHE_LINE_SIZE can also be set by tusb_config.h or CFLAGS, default to 64 for P4

  1. We prefer explicit cache synchronization, but we can accept alternative solutions for the sake of uniformity with other vendors. The reasons are as follows:

    • Code safety:
      If we decide to use non-cacheable addresses for CPU access, we must ensure that no part of the code accesses the same memory through cacheable addresses. This is particularly challenging given that TinyUSB is a third-party component, increasing the risk of regression.
    • Canonical approach:
      Explicit cache synchronization is the standard and expected solution for handling cache coherence. While this might seem like a weak argument, deviating from it could add to our technical debt. Additionally, this approach is more platform-independent, as it relies on standardized synchronization mechanisms provided by most architectures, whereas managing non-cacheable memory regions often involves platform-specific configurations.
    • Future-proofing:
      Accessing memory via non-cacheable addresses only works for internal RAM. External RAM, on the other hand, can only be accessed through the cache. While we currently do not support placing USB buffers in external RAM, this limitation would become a significant blocker in future implementations.

explicit dache clean/invalidate is currently implemented. If user want to use non-cache, can simply override the CFG_TUH_MEM_DCACHE_ENABLE=0 which is enabled when DMA mode is enabled. So I guess we are all good. You wan to review the other PR, note though I made bunch of un-related bsp/freertos changes as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants