-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvMixColumn() optimization #159
base: master
Are you sure you want to change the base?
Conversation
Hi @dmitrystu and thanks for contributing this optimization with great comments! Sorry for keeping you waiting so long. I think the original purpose of the multiply-as-a-function was for 8-bit devices. But I can't remember anymore. @DamonHD do you have any insights into the implications for the 8-bit crowd? |
That's not a problem. I'll try to explain this commit to you.
Not bad, huh? |
Thanks for the explanation - I was also able to follow the comments you've added in the code :) Good job
No it's quite brilliant - thank you for contributing :) I would still like to hear from @DamonHD w.r.t. implications for the 8-bit guys, but I think this change will be an improvement for those chips as well. |
I'd say that whatever we have in our lib was what we found worked well enough*, but note that we adopted this code (the original is also checked in). I think that this is the relevant file from ours? https://github.com/opentrv/OTAESGCM/blob/master/content/OTAESGCM/utility/OTAESGCM_OTAESGCM.cpp Rgds Damon *Full packaging and encryption (or decode) of a ~63 byte frame on an ATmega328p running ~1MHz CPU is ~0.5s IIRC. We only needed to do one every couple of minutes in the most common case, and CPU was already only a tiny fraction of our energy budget, so we didn't fiddle further. Other systems' constraints will be different. PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!) |
IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations. |
and
That boat sailed long ago :P
See https://www.bearssl.org/ctmul.html (look a bit down the page) for a more general discussion. BUT I've read a lot on proof-of-concept timing attacks. To the best of my knowledge however, there haven't been (m)any practical attacks in the wild using this technique. If your hardware/system mostly handles the crypto and not much else (e.g. imagine a MCU in a safe), then it could be vulnerable to having the timing and/or power measured. It's a super valid point though! And on "real" hardware (e.g. x86 or ARM64) you should definitely use something else than this library to get constant time primitives. |
Hmm, I just tried comparing the size of the code compiled using this optimization. It looks like the binary size increases on AVR and stays the same on ARM/Thumb. So on AVR it doesn't look like a good optimization w.r.t. size (I would need to benchmark to test w.r.t. speed) and for ARM it doesn't make any difference. I will have to look more into this. For AVR:
For ARM/Thumb
I tested with these versions of gcc
|
It's so strange you got a different code size for CTR on AVR. CTR doesn't use decipher. Perhaps you forgot to rebase a feature branch onto the latest master. Here is what I got with the latest master and ECB.
|
I meant to compare github.com/kokke/tiny-AES-c (branch=main) with github.com/dmitrystu/tiny-AES-c (branch=feat/optimize_inv_mix) but I might have messed up - I'll check when I get home from work .. Thanks for checking up 👍 |
FYI. This commit slightly reduces code size for AVR, because |
BTW. Found another possible optimization for the static uint8_t xtime(uint8_t x)
{
// extend MSB to other bits using arithmetical shift right
int8_t mask = (int8_t)x >> 7;
return ((x<<1) ^ ((uint8_t)mask & 0x1b));
} AVR:
cortex-m0:
It seems that the AVR compiler knows this template. |
This will optimize GF(2^8) math in InvMixColumns().
With arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release)