AVX128: Zeroing of upper bits can be more optimal with 128-bit operation #3886

Sonicadvance1 · 2024-07-21T20:18:39Z

    "vfmadd231ss xmm0, xmm1, xmm2": {
      "ExpectedInstructionCount": 3,
      "Comment": [
        "Map 2 0b01 0xb9 128-bit"
      ],
      "ExpectedArm64ASM": [
        "fmadd s16, s17, s18, s16",
        "movi v2.2d, #0x0",
        "str q2, [x28, #16]"
      ]
    },

In the common case of an AVX instruction operating at 128-bits we have a movi+str pair to zero the upper 128-bit lane.
There's a couple benefits of this,

We cache this zero register in multiple instruction blocks. This allows multiple 128-bit operations to eat a single move and dead store elimination will eliminate all the stores aside from the last one.
The zero register retains the FPRClass to make RA sane. (This is a note for how an improvement can change the class type)

An annoyance with this approach is that if we know we are storing zero to the upper 128-bit lane, and then storing that zero to the context, it would be more optimal to remove the movi and do an stp xzr, xzr, [x28, #16]

This removes the movi instruction
This uses a GPR stp instruction which is one cycle on Cortex
Removes the movi->stp dependency

The downside to this approach is if that zero register actually needs to be in a vector register, it gets confusing. Think about a 128-bit AVX instruction zeroing the upper bits, and then a 256-bit AVX instruction doing a vector or or something, merging the bottom 128-bit lane but effectively doing a move to the top 128-bits. Need to be careful not to make the intermixed version worse. Which would likely degrade in to a context store, then load to avoid the RA class mishmash when today it could have just stayed as a living value without hitting context.

Will need some noodling to figure out what is best here. Maybe we need to pump the IsInlineConstant value through from OpcodeDispatcher to backend for vector sources but supporting only the source being _LoadNamedVectorConstant(Size, IR::NamedVectorConstant::NAMED_VECTOR_ZERO)?

example of what the previous example could turn in to

    "vfmadd231ss xmm0, xmm1, xmm2": {
      "ExpectedInstructionCount": 2,
      "Comment": [
        "Map 2 0b01 0xb9 128-bit"
      ],
      "ExpectedArm64ASM": [
        "fmadd s16, s17, s18, s16",
        "stp xzr, xzr, [x28, #16]"
      ]
    },

Example of a test case that we don't want to make worse

    "128-bit test": {
      "ExpectedInstructionCount": 6,
      "x86Insts": [
        "vaddps xmm0, xmm1, xmm2",
        "vorps ymm0, ymm3"
      ],
      "ExpectedArm64ASM": [
        "fadd v16.4s, v17.4s, v18.4s",
        "movi v2.2d, #0x0",
        "ldr q3, [x28, #64]",
        "orr v16.16b, v16.16b, v19.16b",
        "orr v2.16b, v2.16b, v3.16b",
        "str q2, [x28, #16]"
      ]
    },

The text was updated successfully, but these errors were encountered:

Sonicadvance1 added the Performance Things that will make FEX faster label Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX128: Zeroing of upper bits can be more optimal with 128-bit operation #3886

AVX128: Zeroing of upper bits can be more optimal with 128-bit operation #3886

Sonicadvance1 commented Jul 21, 2024 •

edited

Loading

AVX128: Zeroing of upper bits can be more optimal with 128-bit operation #3886

AVX128: Zeroing of upper bits can be more optimal with 128-bit operation #3886

Comments

Sonicadvance1 commented Jul 21, 2024 • edited Loading

Sonicadvance1 commented Jul 21, 2024 •

edited

Loading