Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double exit payments on snl detach #1760

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from

Conversation

Doy-lee
Copy link
Collaborator

@Doy-lee Doy-lee commented Nov 21, 2024

The general premise here is to ensure that blockchain detach, detaches the SNL and SQL DB to the exact same height. This ensures that the systems rewind back and resync from the same starting point. Then, if the act of syncing the SNL affects the SQL DB like inserting new rows, this is handled correctly as the SQL DB is also at the exact same height as the SNL as to when the row was produced and any logic in the SQL DB that relies on that information is handled correctly.

The best way to do this is to just embed the SQL DB into the SNL and lock-step them together because they are inherently linked to each other (if you add a block to the SNL and not the SQL DB then the ledger becomes incorrect). This is a hard requirement in the sequence of API calls you need to do when a block is added for correctness. That'd be a very large and disruptive change that I've not done. More reasonably I've deleted some public APIs on the SQL DB and merged them into the SNL to ensure the SQL is updated atomically with the SNL.

This leads to the following changes in the PR:

  • SQL pop_blocks is deleted because the SNL does not support pop_blocks. However both SNL and SQL support blockchain detach by storing state at particular checkpoints so that is the method they use to handle pop blocks.
  • The SQL and SNL backup exactly the same number of blocks for handling pop_blocks and reorgs. They use the shared variables from network_config.h
    • HISTORY_ARCHIVE_INTERVAL: The number of blocks between that we store a backup of the state, set to 10k blocks. This is the same value as we had before. These are stored for a rolling window of HISTORY_ARCHIVE_KEEP_WINDOW blocks. Archived state is for big re-orgs (typically invoked by using pop blocks manually).
    • HISTORY_ARCHIVE_KEEP_WINDOW: The rolling window of how many archives we keep. This is set to 2 years. Before this PR we kept them in perpetuity. Realistically this is not necessary and I've added this to bound the size of this over time.
    • HISTORY_RECENT_KEEP_WINDOW: How many blocks worth of state from the tip should be kept. This is for short re-orgs. This was previously a local variable in the SNL. This was lifted to be global so that it can be shared by the SQL DB.

Now both SNL and SQL DB use these variables to determine when they take a snapshot of their current state and in doing so both store the exact same number of backups that they can then detach to when requested and hence resync from the same starting point. The calculation and backup of state is done in the SNL and SQL individually which is ripe for bugs as the SNL has to backup using the binary serialisation system and the SQL DB using SQL queries. To guard against bugs and in this and futurebugs I've added a bunch of sanity checks and asserts in various places to try and catch and recover if they go out of sync.

  • When we commit rows from the SNL into the SQL DB (processing exit stakes) the row has a unique clause on the (block_height, tx_index, contributor_index) which outright prevents the SQL from inserting payment for a node twice if it somehow was made to process a block twice. This blocks the issue we're encountering now on stagenet where a rescan was re-processing exits into the SQL DB.
  • Asserts to check that on rescan and block add, ONS, SQL and SNL are at the same height.
  • On SNL detach, before detaching to the desired height it checks that the SNL and SQL DB both have backups of the height to detach to. If it doesn't, it'll walk back to find the height that they can rewind to before proceeding. If they can't agree then they both resync from scratch. This is also run on startup (we call exec_detach_hooks on Blockchain::init) which means that if at any point the daemon crashes, systems go out of sync, a restart should detect this and get them to resync so it's fairly more resilient over time to bugs.

Part of this change required was figuring out how the old code worked, what it did, e.t.c and so bug fixes and improvements were made in the pre-existing process that I believe make it more robust and understandable.

  • Fixed the backups of recent SNL state only storing quorums. We were calculating
uint64_t const max_short_term_height =
            short_term_state_cull_height(hf_version, (m_state.height - 1)) + VOTE_LIFETIME +
            VOTE_OR_TX_VERIFY_HEIGHT_BUFFER;

Which was offsetting the height, and then again in the function offsetting the height argument to something OOB which would trigger the serialiser to only store quorums. That's fixed now having reviewed the serialisation routines. Removed most of the branches and the scary height calculations and made it just read top-to-bottom with explanations of each step.

  • Removed various conversions back and forth from uint64_t -> int64_t in the oxen rescanning loop.
  • Removed a bunch of SQL and SNL height checks at the start of the oxen rescanning loop. SNL and SQL accept heights below the hardfork that they're activated which means you can check that the heights match between the two and avoid having logic to clamp heights to the lowest common denominator if they are below the hardforks e.t.c
  • Add loading metrics to the rescan which reveals that we're rescanning at an abysmal speed of less than 1 MiB/s. Room for improvement down the line now that the metrics are available.
  • Remove STATE_CHANGE_TX_LIFETIME_IN_BLOCKS with VOTE_LIFETIME, they are the same value but also mean the same thing.
  • Fix the filter_and_sort sorting by the contents of pubkey_and_info instead of just pubkey

- Store hardfork heights at the top of the function to reduce line
length and minimise the amount of line-breaking enforced by the
formatter. This breaks up the text-heavy noise in the height enumeration step.

- Separate the step where we calculate the minimum height of all the
subsystems which is the height to start loading blocks from to after
enumerating all the heights. This makes it very clear what heights we
need to run the min() function on and find the earliest height.

- Remove the vector accumulating heights. It's clearer now what heights to min()
now that the enumerating height and processing the height is separated.

- Avoid signed int math. Unsigned works fine by relying on well-defined
wrapping behaviour which removes the noisy static_casts in the main loop.

- Remove early exit checking on there being no blocks to check. The
bottom of the function had the exact same check. The loop is well
behaved for (total_blocks == 0) so we can just let the function proceed
rather than add another branch.
In practice this never became a problem because our list is a unique set
of Ed25519 keys but for correctness we correctly compare _just_ the
Ed25519 contents for sorting.
We do a dupe insertion check against the {block,tx,contrib} index which
is guaranteed to be unique. Note we do _not_ check/store block hash
because our threat model here is that the block hash is valid by the
time this part of the code is met (e.g. it passed all of blockchain and
SNL validation essentially).

To do this we have to pipe through some new information from the SNL
when it calls into SQL DB to insert a delayed payment (i.e. return of
stake). This makes it so that essentially there is 1 row for each
returned stake as they are uniquely identified by the {block,tx,contrib}
index.
…_ARCHIVE_INTERVAL/HISTORY_KEEP_RECENT_WINDOW
Incremental steps to getting the the SQL and SNL to handle
blockchain_detached in the same manner. They both need to detach to
exactly the same height so they must also store backups to the same
degree on both sides so that one subsystem doesn't desync on detach.

If they desync, we risk the 2 systems producing inconsistent rewards
which breaks BLS aggregation.
- Short term serialisation was broken because the height that was
calculated as the cutoff height to serialise was incorrectly adding a
offset to the current height before being subtracted again such that all
the SNL states that were serialised only serialised their quorums.

This means that every time we reorg, 1 or 2 blocks, this triggers a full
rescan from the last 10k-th block interval. This was hidden in the code
because the logic was hard to read and modify.

- The serialisation routine for some reason was also trying to serialise
the quorums at a cut-off height, but, when you process a block into the
SNL you _already_ cull history that is too old. Hence when you go to
serialise, the list of states has already been pruned, there's no need
to have logic to prune it again. All we have to do is serialise it to
disk. That simplifies the serialisation routine which unveiled the bug
in the above bulletpoint.

- Remove tricky logic that migrated the short-term window into the
archive window. This had a bunch of checks embedded in a loop with some
arithmetic to calculate some offsets from the height. This is not
necessary if we just serialise the state on block add by determining if
the current state falls under the criteria to be serialised into the
archive or not. This simplifies the loop, de-nests the algorithm and
hence reduces the cognitive load required to grok what's happening.

- Fix an issue where SNL::load() calls reset(false). Reset was clearing
out the SQL DB causing a full-rescan on startup every time even if the
SNL and SQL DB's were in sync.
Unifies the amount of archive/recent state kept by the SQL and SNL DB.
This means that when the systems detach, they are able to detach to the
same height as each other and then able to resync the chain from the
same starting point.

Syncing from the same starting point is paramount to ensuring that the
same results are derived as loading blocks into the SNL has side effects
on the SQL DB.
This ensures that the first SNL state we store in our recent history
matches the first SQL state that the SQL DB stores. This means a slight
adjustment in the SNL serialisation code in that it has to serialise
itself, 'm_state' in tandem with 'm_transient.state_history'.
The wrapper allows us to check and assert if the detach popped us to the
right block for testing and verifying that the detach is working.

Small rewrite of the detach logic in SNL to read top-to-bottom with
minimal branching or early returns.
We detach to a height that is supported by both SNL and SQL by checking
the backups available on both before proceeding. This ensures in
exceptional cases if the SQL side is missing a backup, the SNL doesn't
go ahead and rewind itself. This would cause a desync that can only be
remedied by noticing it, and then restarting the daemon.

Instead the 2 double check what is a safe height they both support and
do that patching up that particular avenue of breakage.
Seems to run fine without the FAKECHAIN check. I checked the unit tests
and they use a real SQL DB instance so there should be no issue _not_
gating the code from executing the code.
Otherwise we store the latest height twice, instead of once
This matches behaviour with the other detach calls that detach after all
blocks have been popped. This gets rid of the detach function in ONS
which just called prune_db which we do so directly in the detach hook.
@Doy-lee Doy-lee force-pushed the doyle-fix-double-exit-payments-on-snl-detach branch from 334adf7 to f8a6cdb Compare November 21, 2024 22:30
@Doy-lee Doy-lee changed the title Doyle fix double exit payments on snl detach Fix double exit payments on snl detach Nov 21, 2024
…s problem

The devnet version of the script deploys a contract that inherits the
mainnet ServiceNodeRewards contract but instead overrides the minimum
exit time. The contract currently rejects exits that occur within the
first 2hrs of the node registering.

We don't have an easy way to get the nodes to fast-fwd their clocks by
2hrs to sign a signature a time in the future.
We don't support popping a block from the SQL DB anymore, only detach to
a recent height which is supported by the SNL, ONS and SQL DB. This is
required for deterministic behaviour on rewind. Since that's the case we
can remove the subtract code from the DB and simplify.
This function always returns true meaning downstream checks on this call
always succeed. This function can fail at the SQL insertion step.
Predominantly, the pattern in the SQL DB is to use exceptions as the
error mechanism and the one function that does call `reward_handler` is
wrapped in a try-catch block.

I've opted to remove the boolean return and instead annotate that the
function throws. This simplifies the calling code in that it no longer
has to do a success check (which was redundant since true was always
returned) and instead optimistically executes and defers to the
try-catch to handle errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant