You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the common case of an AVX instruction operating at 128-bits we have a movi+str pair to zero the upper 128-bit lane.
There's a couple benefits of this,
We cache this zero register in multiple instruction blocks. This allows multiple 128-bit operations to eat a single move and dead store elimination will eliminate all the stores aside from the last one.
The zero register retains the FPRClass to make RA sane. (This is a note for how an improvement can change the class type)
An annoyance with this approach is that if we know we are storing zero to the upper 128-bit lane, and then storing that zero to the context, it would be more optimal to remove the movi and do an stp xzr, xzr, [x28, #16]
This removes the movi instruction
This uses a GPR stp instruction which is one cycle on Cortex
Removes the movi->stp dependency
The downside to this approach is if that zero register actually needs to be in a vector register, it gets confusing. Think about a 128-bit AVX instruction zeroing the upper bits, and then a 256-bit AVX instruction doing a vector or or something, merging the bottom 128-bit lane but effectively doing a move to the top 128-bits. Need to be careful not to make the intermixed version worse. Which would likely degrade in to a context store, then load to avoid the RA class mishmash when today it could have just stayed as a living value without hitting context.
Will need some noodling to figure out what is best here. Maybe we need to pump the IsInlineConstant value through from OpcodeDispatcher to backend for vector sources but supporting only the source being _LoadNamedVectorConstant(Size, IR::NamedVectorConstant::NAMED_VECTOR_ZERO)?
example of what the previous example could turn in to
In the common case of an AVX instruction operating at 128-bits we have a movi+str pair to zero the upper 128-bit lane.
There's a couple benefits of this,
An annoyance with this approach is that if we know we are storing zero to the upper 128-bit lane, and then storing that zero to the context, it would be more optimal to remove the movi and do an
stp xzr, xzr, [x28, #16]
The downside to this approach is if that zero register actually needs to be in a vector register, it gets confusing. Think about a 128-bit AVX instruction zeroing the upper bits, and then a 256-bit AVX instruction doing a vector or or something, merging the bottom 128-bit lane but effectively doing a move to the top 128-bits. Need to be careful not to make the intermixed version worse. Which would likely degrade in to a context store, then load to avoid the RA class mishmash when today it could have just stayed as a living value without hitting context.
Will need some noodling to figure out what is best here. Maybe we need to pump the
IsInlineConstant
value through from OpcodeDispatcher to backend for vector sources but supporting only the source being_LoadNamedVectorConstant(Size, IR::NamedVectorConstant::NAMED_VECTOR_ZERO)
?example of what the previous example could turn in to
Example of a test case that we don't want to make worse
The text was updated successfully, but these errors were encountered: