internal/field: Think about using fiat-crypto #78

Yawning · 2021-07-28T07:11:42Z

fiat-crypto has neat formally verified/auto-generated field arithmetic that can be used to replace the fiddly bits of the internal/field package. It would be nice to be able to use it, since it would let me feel less scared of exposing the package, and it's a nice checkbox to have.

Actually doing the switch requires the fiat-crypto performance to not be horrifically bad, since the current code works (and has been reviewed externally for correctness). Benchmarks from Novi's curve25519-dalek fork (that has been upstreamed) documentation indicate that a 8-15% performance degradation is to be expected, but the actual observed results are significantly worse.

Write an initial naive experimental branch to get ballpark benchmarks.
Be really really really sad about performance.
Try to figure out why performance is, significantly worse, when the rust code is just "slightly worse".
- CarryMul is ridiculously slow because of addcarryxU64
- CarrySquare is ridiculously slow because of addcarryxU64
- Go's shit inliner strikes again. The rust code #[inline]s fiat_25519_carry_mul among other things. Carry is over the inliner budget.
Try to reduce the performance penalty further
- Switch CarrySquare over to CarryPow2k
- Add Add/Sub/Opp + Carry
- Add Add/Sub + CarryMul/CarrySquare
- Fix X25519 performance
Benchmark on 32-bit ARM or something.

Current status (2021/08/09)

The upstream fiat-crypto developers were kind enough to fold in some performance related changes (CarryMul/CarrySquare performance fixed, Add/Sub/Opp + Carry added).

My branch still adds CarryPow2k and Add/Sub + CarryMul/CarrySquare, which I suppose I can ask about.

Since the branch is close enough to what I envision a switch would look like, these are the current rough benchmarks:

With AVX2 used where it exists:

name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    103µs ± 0%     109µs ± 0%   +5.67%
VerifyBatchOnly/2-8                    146µs ± 0%     157µs ± 0%   +7.56%
VerifyBatchOnly/4-8                    232µs ± 0%     252µs ± 0%   +8.59%
VerifyBatchOnly/8-8                    402µs ± 0%     444µs ± 0%  +10.49%
VerifyBatchOnly/16-8                   742µs ± 0%     826µs ± 0%  +11.31%
VerifyBatchOnly/32-8                  1.43ms ± 0%    1.58ms ± 0%  +10.91%
VerifyBatchOnly/64-8                  2.76ms ± 0%    3.11ms ± 0%  +12.55%
VerifyBatchOnly/128-8                 5.24ms ± 0%    5.95ms ± 0%  +13.61%
VerifyBatchOnly/256-8                 9.52ms ± 0%   10.91ms ± 0%  +14.57%
VerifyBatchOnly/384-8                 13.5ms ± 0%    15.6ms ± 0%  +15.35%
VerifyBatchOnly/512-8                 17.7ms ± 0%    20.4ms ± 0%  +15.28%
VerifyBatchOnly/768-8                 25.4ms ± 0%    29.6ms ± 0%  +16.32%
VerifyBatchOnly/1024-8                32.8ms ± 0%    38.4ms ± 0%  +17.11%
GenerateKey/voi-8                     25.5µs ± 0%    27.9µs ± 0%   +9.33%
GenerateKey/stdlib-8                  43.3µs ± 0%    43.1µs ± 0%   -0.41%
NewKeyFromSeed/voi-8                  25.3µs ± 0%    27.7µs ± 0%   +9.43%
NewKeyFromSeed/stdlib-8               42.7µs ± 0%    42.8µs ± 0%   +0.07%
Signing/voi-8                         28.0µs ± 0%    30.3µs ± 0%   +8.21%
Signing/stdlib-8                      52.2µs ± 0%    52.2µs ± 0%   -0.04%
Verification/voi-8                    72.9µs ± 0%    79.4µs ± 0%   +8.94%
Verification/voi_stdlib-8             83.3µs ± 0%    90.9µs ± 0%   +9.13%
Verification/stdlib-8                  124µs ± 0%     124µs ± 0%   +0.15%
Expanded/NewExpandedPublicKey-8       11.4µs ± 0%    14.2µs ± 0%  +24.98%
Expanded/Verification/voi-8           62.6µs ± 0%    65.5µs ± 0%   +4.65%
Expanded/Verification/voi_stdlib-8    74.9µs ± 0%    78.9µs ± 0%   +5.39%
Expanded/VerifyBatchOnly/1-8          91.9µs ± 0%    96.7µs ± 0%   +5.18%
Expanded/VerifyBatchOnly/2-8           123µs ± 0%     128µs ± 0%   +4.02%
Expanded/VerifyBatchOnly/4-8           184µs ± 0%     197µs ± 0%   +6.76%
Expanded/VerifyBatchOnly/8-8           308µs ± 0%     331µs ± 0%   +7.54%
Expanded/VerifyBatchOnly/16-8          556µs ± 0%     597µs ± 0%   +7.37%
Expanded/VerifyBatchOnly/32-8         1.05ms ± 0%    1.13ms ± 0%   +7.70%
Expanded/VerifyBatchOnly/64-8         2.02ms ± 0%    2.20ms ± 0%   +8.56%
Expanded/VerifyBatchOnly/128-8        3.87ms ± 0%    4.24ms ± 0%   +9.47%
Expanded/VerifyBatchOnly/256-8        7.03ms ± 0%    7.77ms ± 0%  +10.38%
Expanded/VerifyBatchOnly/384-8        10.0ms ± 0%    11.1ms ± 0%  +10.93%
Expanded/VerifyBatchOnly/512-8        13.1ms ± 0%    14.5ms ± 0%  +11.00%
Expanded/VerifyBatchOnly/768-8        18.6ms ± 0%    20.9ms ± 0%  +12.10%
Expanded/VerifyBatchOnly/1024-8       24.0ms ± 0%    26.9ms ± 0%  +11.68%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        24.5µs ± 0%    26.9µs ± 0%   +9.54%
ScalarMult/voi-8            80.2µs ± 0%   113.0µs ± 0%  +40.98%

Note: The massive regression for X25519 ScalarMult is due to the removal of assembly. Sufficiently recent x/crypto/curve25519 does away with it as well, and clocks in at ~107us/op (~161us/op purego).

purego:

name                                old time/op    new time/op    delta
VerifyBatchOnly/1-8                    164µs ± 0%     179µs ± 0%   +9.15%
VerifyBatchOnly/2-8                    238µs ± 0%     247µs ± 0%   +3.79%
VerifyBatchOnly/4-8                    351µs ± 0%     383µs ± 0%   +9.18%
VerifyBatchOnly/8-8                    603µs ± 0%     657µs ± 0%   +9.03%
VerifyBatchOnly/16-8                  1.10ms ± 0%    1.21ms ± 0%   +9.29%
VerifyBatchOnly/32-8                  2.09ms ± 0%    2.29ms ± 0%   +9.60%
VerifyBatchOnly/64-8                  4.09ms ± 0%    4.48ms ± 0%   +9.36%
VerifyBatchOnly/128-8                 8.04ms ± 0%    8.93ms ± 0%  +11.05%
VerifyBatchOnly/256-8                 14.3ms ± 0%    15.9ms ± 0%  +11.09%
VerifyBatchOnly/384-8                 20.2ms ± 0%    22.5ms ± 0%  +11.43%
VerifyBatchOnly/512-8                 26.3ms ± 0%    29.3ms ± 0%  +11.46%
VerifyBatchOnly/768-8                 37.3ms ± 0%    41.4ms ± 0%  +11.10%
VerifyBatchOnly/1024-8                48.1ms ± 0%    53.4ms ± 0%  +11.20%
GenerateKey/voi-8                     48.8µs ± 0%    52.6µs ± 0%   +7.80%
GenerateKey/stdlib-8                  58.3µs ± 0%    58.2µs ± 0%   -0.02%
NewKeyFromSeed/voi-8                  48.5µs ± 0%    52.8µs ± 0%   +8.84%
NewKeyFromSeed/stdlib-8               58.2µs ± 0%    58.1µs ± 0%   -0.12%
Signing/voi-8                         51.3µs ± 0%    55.0µs ± 0%   +7.20%
Signing/stdlib-8                      72.4µs ± 0%    72.2µs ± 0%   -0.27%
Verification/voi-8                     108µs ± 0%     117µs ± 0%   +8.55%
Verification/voi_stdlib-8              131µs ± 0%     145µs ± 0%  +10.17%
Verification/stdlib-8                  181µs ± 0%     181µs ± 0%   -0.22%
Expanded/NewExpandedPublicKey-8       14.5µs ± 0%    16.0µs ± 0%  +10.85%
Expanded/Verification/voi-8           93.4µs ± 0%   101.9µs ± 0%   +9.16%
Expanded/Verification/voi_stdlib-8     119µs ± 0%     132µs ± 0%  +11.36%
Expanded/VerifyBatchOnly/1-8           149µs ± 0%     163µs ± 0%   +9.48%
Expanded/VerifyBatchOnly/2-8           196µs ± 0%     215µs ± 0%   +9.80%
Expanded/VerifyBatchOnly/4-8           293µs ± 0%     319µs ± 0%   +9.03%
Expanded/VerifyBatchOnly/8-8           484µs ± 0%     530µs ± 0%   +9.44%
Expanded/VerifyBatchOnly/16-8          868µs ± 0%     943µs ± 0%   +8.67%
Expanded/VerifyBatchOnly/32-8         1.62ms ± 0%    1.77ms ± 0%   +9.44%
Expanded/VerifyBatchOnly/64-8         3.13ms ± 0%    3.43ms ± 0%   +9.75%
Expanded/VerifyBatchOnly/128-8        6.33ms ± 0%    7.00ms ± 0%  +10.72%
Expanded/VerifyBatchOnly/256-8        11.3ms ± 0%    12.6ms ± 0%  +11.41%
Expanded/VerifyBatchOnly/384-8        15.9ms ± 0%    17.7ms ± 0%  +11.16%
Expanded/VerifyBatchOnly/512-8        20.7ms ± 0%    23.0ms ± 0%  +11.17%
Expanded/VerifyBatchOnly/768-8        29.1ms ± 0%    32.4ms ± 0%  +11.34%
Expanded/VerifyBatchOnly/1024-8       37.4ms ± 0%    41.5ms ± 0%  +10.98%

name                      old time/op    new time/op    delta
ScalarBaseMult/voi-8        47.8µs ± 0%    51.7µs ± 0%  +8.18%
ScalarMult/voi-8             118µs ± 0%     113µs ± 0%  -4.05%

This is getting to "an acceptable slowdown" if people think that using the fiat code is better over the code that came from dalek, but the issue of having to ship modified routines from the 64/curve25519 implementation remains.

The text was updated successfully, but these errors were encountered:

Yawning mentioned this issue Jul 28, 2021

DO NOT MERGE: internal/field fiat-crypto playground #80

Draft

4 tasks

Yawning added the enhancement New feature or request label Jul 29, 2021

Yawning changed the title ~~internal/field: Keep an eye on fiat-crypto~~ internal/field: Think about using fiat-crypto Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/field: Think about using fiat-crypto #78

internal/field: Think about using fiat-crypto #78

Yawning commented Jul 28, 2021 •

edited

Loading

internal/field: Think about using fiat-crypto #78

internal/field: Think about using fiat-crypto #78

Comments

Yawning commented Jul 28, 2021 • edited Loading

Current status (2021/08/09)

Yawning commented Jul 28, 2021 •

edited

Loading