InvMixColumn() optimization #159

dmitrystu · 2020-03-08T13:05:48Z

This will optimize GF(2^8) math in InvMixColumns().

With arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release)

arm-none-eabi-gcc -mcpu=cortex-m3 --function-sections -Os -c aes.c -o aes.o
arm-none-eabi-nm -S aes.o
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 0) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 000001ac t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 1) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000164 t InvCipher
00000000 0000007c t KeyExpansion
00000000 00000032 t Multiply
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- AFTER ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000140 t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime

kokke · 2021-12-20T20:18:20Z

Hi @dmitrystu and thanks for contributing this optimization with great comments! Sorry for keeping you waiting so long.

I think the original purpose of the multiply-as-a-function was for 8-bit devices. But I can't remember anymore. @DamonHD do you have any insights into the implications for the 8-bit crowd?

dmitrystu · 2021-12-21T01:17:47Z

That's not a problem. I'll try to explain this commit to you.
No matter what's the reason to use Multiply as a function or as a macro. This commit optimizes a number of the math operations.
Each Multiply(x,y) over GF8 generally uses 10 multiply ⊗2 and 4 ⊕. In some cases, the number of operations will be reduced by the compiler.
So, originally A' = (A ⊗ 0x0e) ⊕ (B ⊗ 0x0b) ⊕ (C ⊗ 0x0d) ⊕ (D ⊗ 0x09)
Since 0x0E = 8 ⊕ 4 ⊕ 2 -> A⊗0x0E = (A⊗8) ⊕ (A⊗4) ⊕ (A⊗2)
Therefore
A' = (A⊗8 ⊕ A⊗4 ⊕ A⊗2) ⊕ (B⊗8 ⊕ B⊗2 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C) ⊕ (D⊗8 ⊕ D)
B' = (A⊗8 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B⊗2) ⊕ (C⊗8 ⊕ C⊗2 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D)
C' = (A⊗8 ⊕ A⊗4 ⊕ A) ⊕ (B⊗8 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C⊗2) ⊕ (D⊗8 ⊕ D⊗2 ⊕ D)
D' = (A⊗8 ⊕ A⊗2 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B) ⊕ (C⊗8 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D⊗2)
You will get this if you expand a Multiply(x,y) macro for the A', B', C', D'
We have a 16 X⊗8, 8 X⊗4, 8 X⊗2, and 40 ⊕ ops.
Let's optimize something
Since (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) = A ⊕ B ⊕ C and A⊗(B ⊕ C) = (A⊗B) ⊕ (A⊗C) and A ⊕ A = 0
Let T = (A ⊕ B ⊕ C ⊕ D)⊗8 ⊕ (A ⊕ B ⊕ C ⊕ D)
Then
A' = T ⊕ (A ⊕ C)⊗4 ⊕ (A ⊕ B)⊗2 ⊕ A
B' = T ⊕ (B ⊕ D)⊗4 ⊕ (B ⊕ C)⊗2 ⊕ B
C' = T ⊕ (A ⊕ C)⊗4 ⊕ (C ⊕ D)⊗2 ⊕ C
D' = T ⊕ (B ⊕ D)⊗4 ⊕ (D ⊕ A)⊗2 ⊕ D
Let calculate T as
T = (A ⊕ B ⊕ C ⊕ D); T = T ⊕ T⊗8
And reuse (A ⊕ C)⊗4 and (B ⊕ D)⊗4

Num of ops	Old	New	Remark
⊗8	16	1	xtime(xtime(xtime(X)))
⊗4	8	2	xtime(xtime(X))
⊗2	8	4	xtime(X)
⊕	40	18	XOR

Not bad, huh?

kokke · 2021-12-21T15:49:15Z

Thanks for the explanation - I was also able to follow the comments you've added in the code :) Good job

Not bad, huh?

No it's quite brilliant - thank you for contributing :)

I would still like to hear from @DamonHD w.r.t. implications for the 8-bit guys, but I think this change will be an improvement for those chips as well.

DamonHD · 2021-12-21T17:21:48Z

I'd say that whatever we have in our lib was what we found worked well enough*, but note that we adopted this code (the original is also checked in). I think that this is the relevant file from ours?

https://github.com/opentrv/OTAESGCM/blob/master/content/OTAESGCM/utility/OTAESGCM_OTAESGCM.cpp

Rgds

Damon

*Full packaging and encryption (or decode) of a ~63 byte frame on an ATmega328p running ~1MHz CPU is ~0.5s IIRC. We only needed to do one every couple of minutes in the most common case, and CPU was already only a tiny fraction of our energy budget, so we didn't fiddle further. Other systems' constraints will be different.

PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)

dmitrystu · 2021-12-22T01:13:56Z

IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.

kokke · 2021-12-22T08:50:39Z

IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.

and

PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)

That boat sailed long ago :P
Many microprocessors can't execute multiplication-instructions in constant time. On small ARM CPUs the execution time (number of cycles) for multiplication depends on how many bits are set in the operands/arguments.

From https://developer.arm.com/documentation/ddi0180/a/instruction-cycle-summary-and-interlocks/instruction-cycle-times/multiplier-cycle-counts

The number of cycles that a multiply instruction takes to complete depends on which instruction it is, and on the value of the multiplier-operand.

See https://www.bearssl.org/ctmul.html (look a bit down the page) for a more general discussion.

BUT I've read a lot on proof-of-concept timing attacks. To the best of my knowledge however, there haven't been (m)any practical attacks in the wild using this technique.

If your hardware/system mostly handles the crypto and not much else (e.g. imagine a MCU in a safe), then it could be vulnerable to having the timing and/or power measured.
If you do handle other stuff (e.g. field-bus communication etc.) it's my experience that you usually introduce enough jitter to make the jitter indistinguishable from white noise -- the timing differences drown in variability from other parts of the software.

It's a super valid point though! And on "real" hardware (e.g. x86 or ARM64) you should definitely use something else than this library to get constant time primitives.

kokke · 2021-12-22T09:11:51Z

Hmm, I just tried comparing the size of the code compiled using this optimization.

It looks like the binary size increases on AVR and stays the same on ARM/Thumb.

So on AVR it doesn't look like a good optimization w.r.t. size (I would need to benchmark to test w.r.t. speed) and for ARM it doesn't make any difference. I will have to look more into this.

For AVR:

/tmp/tiny-AES-c git:(master) 
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1157       0       0    1157     485 aes.o

/tmp/tiny-AES-c git:(master) 
$ cd ../tiny-AES-c-opt/

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix) 
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1193       0       0    1193     4a9 aes.o

For ARM/Thumb

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
    883       0       0     883     373 aes.o

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ cd ../cd tiny-AES-c

/tmp/tiny-AES-c git:(master)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
    883       0       0     883     373 aes.o

I tested with these versions of gcc

$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:10.3-2021.07-1) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

dmitrystu · 2021-12-22T13:29:57Z

It's so strange you got a different code size for CTR on AVR. CTR doesn't use decipher. Perhaps you forgot to rebase a feature branch onto the latest master. Here is what I got with the latest master and ECB.

$ arm-none-eabi-gcc --version
rm-none-eabi-gcc (GNU Arm Embedded Toolchain 10.3-2021.07) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
   1399       0       0    1399     577 aes.o

$ git checkout feat/optimize_inv_mix 
Switched to branch 'feat/optimize_inv_mix'
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
   1299       0       0    1299     513 aes.o

$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1593       0       0    1593     639 aes.o

$ git checkout feat/optimize_inv_mix 
Switched to branch 'feat/optimize_inv_mix'
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1471       0       0    1471     5bf aes.o

kokke · 2021-12-22T13:51:15Z

I meant to compare github.com/kokke/tiny-AES-c (branch=main) with github.com/dmitrystu/tiny-AES-c (branch=feat/optimize_inv_mix) but I might have messed up - I'll check when I get home from work ..

Thanks for checking up 👍

dmitrystu · 2021-12-22T14:44:42Z

FYI. This commit slightly reduces code size for AVR, because size_t on AVR is equal to uint16_t.

dmitrystu · 2021-12-22T15:09:33Z

BTW. Found another possible optimization for the xtime(x). This also guarantees constant execution time (no multiplication, nor branches).

static uint8_t xtime(uint8_t x)
{
  // extend MSB to other bits using arithmetical shift right
  int8_t mask = (int8_t)x >> 7;
  return ((x<<1) ^ ((uint8_t)mask & 0x1b));
}

AVR:

00000058 <xtime>:
  58:   98 2f           mov     r25, r24
  5a:   99 0f           add     r25, r25
  5c:   99 0b           sbc     r25, r25
  5e:   9b 71           andi    r25, 0x1B       ; 27
  60:   88 0f           add     r24, r24
  62:   89 27           eor     r24, r25
  64:   08 95           ret

cortex-m0:

00000034 <xtime>:
  34:   0003            movs    r3, r0
  36:   221b            movs    r2, #27
  38:   b240            sxtb    r0, r0
  3a:   11c0            asrs    r0, r0, #7
  3c:   4010            ands    r0, r2
  3e:   005b            lsls    r3, r3, #1
  40:   4058            eors    r0, r3
  42:   b2c0            uxtb    r0, r0
  44:   4770            bx      lr

It seems that the AVR compiler knows this template.

InvMixColumn() optimization

51013b3

Update feat/optimize_inv_mix from master

ea77d6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvMixColumn() optimization #159

InvMixColumn() optimization #159

dmitrystu commented Mar 8, 2020

kokke commented Dec 20, 2021 •

edited

Loading

dmitrystu commented Dec 21, 2021 •

edited

Loading

kokke commented Dec 21, 2021

DamonHD commented Dec 21, 2021 •

edited

Loading

dmitrystu commented Dec 22, 2021

kokke commented Dec 22, 2021

kokke commented Dec 22, 2021

dmitrystu commented Dec 22, 2021

kokke commented Dec 22, 2021 •

edited

Loading

dmitrystu commented Dec 22, 2021

dmitrystu commented Dec 22, 2021 •

edited

Loading

InvMixColumn() optimization #159

Are you sure you want to change the base?

InvMixColumn() optimization #159

Conversation

dmitrystu commented Mar 8, 2020

kokke commented Dec 20, 2021 • edited Loading

dmitrystu commented Dec 21, 2021 • edited Loading

kokke commented Dec 21, 2021

DamonHD commented Dec 21, 2021 • edited Loading

dmitrystu commented Dec 22, 2021

kokke commented Dec 22, 2021

kokke commented Dec 22, 2021

dmitrystu commented Dec 22, 2021

kokke commented Dec 22, 2021 • edited Loading

dmitrystu commented Dec 22, 2021

dmitrystu commented Dec 22, 2021 • edited Loading

kokke commented Dec 20, 2021 •

edited

Loading

dmitrystu commented Dec 21, 2021 •

edited

Loading

DamonHD commented Dec 21, 2021 •

edited

Loading

kokke commented Dec 22, 2021 •

edited

Loading

dmitrystu commented Dec 22, 2021 •

edited

Loading