You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using MTE (https://www.kernel.org/doc/html/latest/arm64/memory-tagging-extension.html) which tracks a 4-bit tag per 16 bytes of memory we can mark every 16 bytes of memory as either being local to a cpu core, thus non-tso accesses are fine, or as shared, thus accesses need to be tso. MTE gives us 4 bits, so we can track up to 15 cpu cores plus 1 id used for shared memory.
We can then detect non-tso accesses to shared and tso accesses to local memory with a synchronous tag mismatch exception, and backpatch the offending instruction to use slower, tso instructions.
The optimisation depends on local/shared memory access being a characteristic of the specific memory operation. While this is overall likely to be true, functions like memcpy will have to deal with both shared and local memory.
There are several limitations to this approach, however it should still be useful for instrumentation and pgo.
Limitations
Ping-ponging between tso and non-tso access modes. I'm not sure there's a good workaround for that, as I don't think we can allow TSO ops to work on both local and shared memory using MTE
Thunks could have issues
Object code caching would be possible, but backpatching has to be done per vm group increasing the memory load
Linux doesn't support PROT_MTE with non ram file-backed mappings PROT_MTE is only supported on MAP_ANONYMOUS and RAM-based file mappings (tmpfs, memfd).
Can only support up to 15 host cpu cores
While tso ops need only to be done in shared memory accesses, not all shared memory accesses need to be tso. It is impossible to detect shared accesses that don't need to be tso using this approach.
Variations
The actual memops could be interpreted in the segfault handler, either to guarantee forward progress and/or limit backpatching
Details
Setup
Compute HostCoreId as such that they are either [0,14] with 15 reserved for shared memory, or as [1, 15] with 0 reserved for shared memory.
Extend the guest cpu frame to have a "current cpu core mte id" field, frame->HostCoreId
Dedicate a caller saved register in the jit to shadow this value, rHostCoreId
Reload rHostCoreId from frame->HostCoreId on every re-enter to the jit abi
Using rseq, keep frame->HostCoreId in sync with the current HostCoreId. An alternative is to read from the rseq core id field.
Using rseq, update rHostCoreId on cpu core migration if the code is in the jit abi
This discussion was converted from issue #1731 on August 10, 2022 17:20.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Following up from yesterday's discussion,
Overview
Using
MTE
(https://www.kernel.org/doc/html/latest/arm64/memory-tagging-extension.html) which tracks a 4-bit tag per 16 bytes of memory we can mark every 16 bytes of memory as either beinglocal
to a cpu core, thusnon-tso
accesses are fine, or asshared
, thus accesses need to betso
. MTE gives us 4 bits, so we can track up to 15 cpu cores plus 1 id used forshared
memory.The
rseq
(https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/) kernel feature can be used to detect when a task is switched between cpu cores.We can then detect
non-tso
accesses toshared
andtso
accesses tolocal
memory with a synchronous tag mismatch exception, and backpatch the offending instruction to use slower, tso instructions.The optimisation depends on
local
/shared
memory access being a characteristic of the specific memory operation. While this is overall likely to be true, functions like memcpy will have to deal with bothshared
andlocal
memory.There are several limitations to this approach, however it should still be useful for instrumentation and pgo.
Limitations
tso
andnon-tso
access modes. I'm not sure there's a good workaround for that, as I don't think we can allow TSO ops to work on bothlocal
andshared
memory using MTEPROT_MTE is only supported on MAP_ANONYMOUS and RAM-based file mappings (tmpfs, memfd).
shared
memory accesses, not allshared
memory accesses need to be tso. It is impossible to detectshared
accesses that don't need to betso
using this approach.Variations
Details
Setup
HostCoreId
as such that they are either [0,14] with 15 reserved forshared
memory, or as [1, 15] with 0 reserved forshared
memory.frame->HostCoreId
rHostCoreId
rHostCoreId
fromframe->HostCoreId
on every re-enter to the jit abiframe->HostCoreId
in sync with the currentHostCoreId
. An alternative is to read from the rseq core id field.rHostCoreId
on cpu core migration if the code is in the jit abiAssuming 0 is used to indicate
shared
memorylocal
memopsshared
memopslocal
->shared
migration & backpatchingSIGSEGV
with.si_code = SEGV_MTESERR
where the offending memop is anon-tso
memop.si_addr
TAG to 0tso
memopshared
->local
migration & backpatchingSIGSEGV
with.si_code = SEGV_MTESERR
where the offending memop is atso
memop.si_addr
TAG toCoreId
non-tso
memopBeta Was this translation helpful? Give feedback.
All reactions