Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize signature aggregation #1753

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

jagerman
Copy link
Member

@jagerman jagerman commented Nov 7, 2024

Currently when an RPC node does a network fanout to get a reward or exit
signature, it verifies each signature as it returns before adding it to
the aggregate.

This ends up being quite slow: on my desktop machine on current
stagenet, in a release build, with 240 stagenet nodeus nodes, just these
verifications calls taking a bit more than a full second in total (while
the aggregation itself takes only a tenth of a second). While still
reasonable on stagenet, on mainnet it's going to be 10s of CPU time per
signature request, which is too much.

This commit rewrites it to speed it up considerably in the normal (i.e.
no bad signature) case:

  • the aggregator aggregates returning pubkeys and signatures without
    individual verification.
  • once we have all results, we then perform a single verification of the
    aggregated signature against the aggregated pubkey. If it succeeds
    there's no other needed checks.
  • if that check fails then we fall back to doing a one-by-one
    verification on all the individual signatures, removing them from the
    aggregates if we find any failures.

In the normal case (where we don't get any failing signatures) this
speeds up aggregate processing by more than 10x by only needing one
signature verification.

A nice side effect of this is that because we always know the aggregate
pubkey now, we can include that in debug logs (previously it was only
available in debug logs in debug builds), and in the RPC result.

(This builds on top of #1751)

This adds a generic interface for making asychronous RPC request
handlers, and applies these to the reward and exit signature endpoints.

The RPC interface adds an alternative the `invoke(...)` method to
optionally take a new, third argument of a `shared_ptr<response>`: when
such a version of the invoke method is present then response is not send
until the destruction of the shared_ptr, and so an asychronous request
uses these, keeps the shared_ptr alive until the response is available,
then lets it destruct which then triggers the response.

This then pushes the async approach through the reward and exit
signature requests, and gets everything working through asychronous
callbacks rather than blocking requests.

This also makes some small changes to signature handling:

- Allow exit/liquidation RPC requests to be made by either SN pubkey (as
  currently) or BLS pubkey (new).

- Allow liquidation of oxend-non-existent BLS pubkeys (i.e. liquidating
  a BLS pubkey that doesn't match any SNs oxen knows of); without this
  it isn't possible to remove a "bad" contract node that oxend didn't
  accept the registration of for whatever reason.

- Add code to remove unwanted node extra signatures from produced
  signatures.  Previously our concept of "non-signers" only included IDs
  to pass to the contract, but there is also a reverse failure were we
  collect a signature from a SN that isn't in the contract anymore (for
  example: a recently removed node with an incoming, but unconfirmed,
  exit event). This amends the signing code to detect any such signers
  and subtract the signatures of any such SNs from the aggregate before
  returning it.

- Removes the walk-the-snode-linked-list contract handling code as it is
  not used anymore.
There are (currently) two places invoking the callback, one of which
wasn't catching exceptions; this genericizes the callback invocation to
fix it, and adds a logging try/catch around the final_callback
invocation as well.
Currently when an RPC node does a network fanout to get a reward or exit
signature, it verifies each signature as it returns before adding it to
the aggregate.

This ends up being quite slow: on my desktop machine on current
stagenet, in a release build, with 240 stagenet nodeus nodes, just these
verifications calls taking a bit more than a full second in total (while
the aggregation itself takes only a tenth of a second).  While still
reasonable on stagenet, on mainnet it's going to be 10s of CPU time per
signature request, which is too much.

This commit rewrites it to speed it up considerably in the normal (i.e.
no bad signature) case:
- the aggregator aggregates returning pubkeys and signatures without
  individual verification.
- once we have all results, we then perform a single verification of the
  aggregated signature against the aggregated pubkey.  If it succeeds
  there's no other needed checks.
- if that check *fails* then we fall back to doing a one-by-one
  verification on all the individual signatures, removing them from the
  aggregates if we find any failures.

In the normal case (where we don't get any failing signatures) this
speeds up aggregate processing by more than 10x by only needing one
signature verification.

A nice side effect of this is that because we always know the aggregate
pubkey now, we can include that in debug logs (previously it was only
available in debug logs *in debug builds*), and in the RPC result.
Switches the aggregate signature caches to a shared_ptr to make the
value thread-safe for storage and use across different threads when
combined with the need to capture the data across async requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant