Fix double exit payments on snl detach #1760

Doy-lee · 2024-11-21T07:21:30Z

The general premise here is to ensure that blockchain detach, detaches the SNL and SQL DB to the exact same height. This ensures that the systems rewind back and resync from the same starting point. Then, if the act of syncing the SNL affects the SQL DB like inserting new rows, this is handled correctly as the SQL DB is also at the exact same height as the SNL as to when the row was produced and any logic in the SQL DB that relies on that information is handled correctly.

The best way to do this is to just embed the SQL DB into the SNL and lock-step them together because they are inherently linked to each other (if you add a block to the SNL and not the SQL DB then the ledger becomes incorrect). This is a hard requirement in the sequence of API calls you need to do when a block is added for correctness. That'd be a very large and disruptive change that I've not done. More reasonably I've deleted some public APIs on the SQL DB and merged them into the SNL to ensure the SQL is updated atomically with the SNL.

This leads to the following changes in the PR:

SQL pop_blocks is deleted because the SNL does not support pop_blocks. However both SNL and SQL support blockchain detach by storing state at particular checkpoints so that is the method they use to handle pop blocks.
The SQL and SNL backup exactly the same number of blocks for handling pop_blocks and reorgs. They use the shared variables from network_config.h
- HISTORY_ARCHIVE_INTERVAL: The number of blocks between that we store a backup of the state, set to 10k blocks. This is the same value as we had before. These are stored for a rolling window of HISTORY_ARCHIVE_KEEP_WINDOW blocks. Archived state is for big re-orgs (typically invoked by using pop blocks manually).
- HISTORY_ARCHIVE_KEEP_WINDOW: The rolling window of how many archives we keep. This is set to 2 years. Before this PR we kept them in perpetuity. Realistically this is not necessary and I've added this to bound the size of this over time.
- HISTORY_RECENT_KEEP_WINDOW: How many blocks worth of state from the tip should be kept. This is for short re-orgs. This was previously a local variable in the SNL. This was lifted to be global so that it can be shared by the SQL DB.

Now both SNL and SQL DB use these variables to determine when they take a snapshot of their current state and in doing so both store the exact same number of backups that they can then detach to when requested and hence resync from the same starting point. The calculation and backup of state is done in the SNL and SQL individually which is ripe for bugs as the SNL has to backup using the binary serialisation system and the SQL DB using SQL queries. To guard against bugs and in this and futurebugs I've added a bunch of sanity checks and asserts in various places to try and catch and recover if they go out of sync.

When we commit rows from the SNL into the SQL DB (processing exit stakes) the row has a unique clause on the (block_height, tx_index, contributor_index) which outright prevents the SQL from inserting payment for a node twice if it somehow was made to process a block twice. This blocks the issue we're encountering now on stagenet where a rescan was re-processing exits into the SQL DB.
Asserts to check that on rescan and block add, ONS, SQL and SNL are at the same height.
On SNL detach, before detaching to the desired height it checks that the SNL and SQL DB both have backups of the height to detach to. If it doesn't, it'll walk back to find the height that they can rewind to before proceeding. If they can't agree then they both resync from scratch. This is also run on startup (we call exec_detach_hooks on Blockchain::init) which means that if at any point the daemon crashes, systems go out of sync, a restart should detect this and get them to resync so it's fairly more resilient over time to bugs.

Part of this change required was figuring out how the old code worked, what it did, e.t.c and so bug fixes and improvements were made in the pre-existing process that I believe make it more robust and understandable.

Fixed the backups of recent SNL state only storing quorums. We were calculating

uint64_t const max_short_term_height =
            short_term_state_cull_height(hf_version, (m_state.height - 1)) + VOTE_LIFETIME +
            VOTE_OR_TX_VERIFY_HEIGHT_BUFFER;

Which was offsetting the height, and then again in the function offsetting the height argument to something OOB which would trigger the serialiser to only store quorums. That's fixed now having reviewed the serialisation routines. Removed most of the branches and the scary height calculations and made it just read top-to-bottom with explanations of each step.

Removed various conversions back and forth from uint64_t -> int64_t in the oxen rescanning loop.
Removed a bunch of SQL and SNL height checks at the start of the oxen rescanning loop. SNL and SQL accept heights below the hardfork that they're activated which means you can check that the heights match between the two and avoid having logic to clamp heights to the lowest common denominator if they are below the hardforks e.t.c
Add loading metrics to the rescan which reveals that we're rescanning at an abysmal speed of less than 1 MiB/s. Room for improvement down the line now that the metrics are available.
Remove STATE_CHANGE_TX_LIFETIME_IN_BLOCKS with VOTE_LIFETIME, they are the same value but also mean the same thing.
Fix the filter_and_sort sorting by the contents of pubkey_and_info instead of just pubkey

- Store hardfork heights at the top of the function to reduce line length and minimise the amount of line-breaking enforced by the formatter. This breaks up the text-heavy noise in the height enumeration step. - Separate the step where we calculate the minimum height of all the subsystems which is the height to start loading blocks from to after enumerating all the heights. This makes it very clear what heights we need to run the min() function on and find the earliest height. - Remove the vector accumulating heights. It's clearer now what heights to min() now that the enumerating height and processing the height is separated. - Avoid signed int math. Unsigned works fine by relying on well-defined wrapping behaviour which removes the noisy static_casts in the main loop. - Remove early exit checking on there being no blocks to check. The bottom of the function had the exact same check. The loop is well behaved for (total_blocks == 0) so we can just let the function proceed rather than add another branch.

In practice this never became a problem because our list is a unique set of Ed25519 keys but for correctness we correctly compare _just_ the Ed25519 contents for sorting.

We do a dupe insertion check against the {block,tx,contrib} index which is guaranteed to be unique. Note we do _not_ check/store block hash because our threat model here is that the block hash is valid by the time this part of the code is met (e.g. it passed all of blockchain and SNL validation essentially). To do this we have to pipe through some new information from the SNL when it calls into SQL DB to insert a delayed payment (i.e. return of stake). This makes it so that essentially there is 1 row for each returned stake as they are uniquely identified by the {block,tx,contrib} index.

…_ARCHIVE_INTERVAL/HISTORY_KEEP_RECENT_WINDOW

Incremental steps to getting the the SQL and SNL to handle blockchain_detached in the same manner. They both need to detach to exactly the same height so they must also store backups to the same degree on both sides so that one subsystem doesn't desync on detach. If they desync, we risk the 2 systems producing inconsistent rewards which breaks BLS aggregation.

- Short term serialisation was broken because the height that was calculated as the cutoff height to serialise was incorrectly adding a offset to the current height before being subtracted again such that all the SNL states that were serialised only serialised their quorums. This means that every time we reorg, 1 or 2 blocks, this triggers a full rescan from the last 10k-th block interval. This was hidden in the code because the logic was hard to read and modify. - The serialisation routine for some reason was also trying to serialise the quorums at a cut-off height, but, when you process a block into the SNL you _already_ cull history that is too old. Hence when you go to serialise, the list of states has already been pruned, there's no need to have logic to prune it again. All we have to do is serialise it to disk. That simplifies the serialisation routine which unveiled the bug in the above bulletpoint. - Remove tricky logic that migrated the short-term window into the archive window. This had a bunch of checks embedded in a loop with some arithmetic to calculate some offsets from the height. This is not necessary if we just serialise the state on block add by determining if the current state falls under the criteria to be serialised into the archive or not. This simplifies the loop, de-nests the algorithm and hence reduces the cognitive load required to grok what's happening. - Fix an issue where SNL::load() calls reset(false). Reset was clearing out the SQL DB causing a full-rescan on startup every time even if the SNL and SQL DB's were in sync.

Unifies the amount of archive/recent state kept by the SQL and SNL DB. This means that when the systems detach, they are able to detach to the same height as each other and then able to resync the chain from the same starting point. Syncing from the same starting point is paramount to ensuring that the same results are derived as loading blocks into the SNL has side effects on the SQL DB.

This ensures that the first SNL state we store in our recent history matches the first SQL state that the SQL DB stores. This means a slight adjustment in the SNL serialisation code in that it has to serialise itself, 'm_state' in tandem with 'm_transient.state_history'.

The wrapper allows us to check and assert if the detach popped us to the right block for testing and verifying that the detach is working. Small rewrite of the detach logic in SNL to read top-to-bottom with minimal branching or early returns.

…recent tableL

…step

We detach to a height that is supported by both SNL and SQL by checking the backups available on both before proceeding. This ensures in exceptional cases if the SQL side is missing a backup, the SNL doesn't go ahead and rewind itself. This would cause a desync that can only be remedied by noticing it, and then restarting the daemon. Instead the 2 double check what is a safe height they both support and do that patching up that particular avenue of breakage.

Seems to run fine without the FAKECHAIN check. I checked the unit tests and they use a real SQL DB instance so there should be no issue _not_ gating the code from executing the code.

Otherwise we store the latest height twice, instead of once

This matches behaviour with the other detach calls that detach after all blocks have been popped. This gets rid of the detach function in ONS which just called prune_db which we do so directly in the detach hook.

…KECHAIN on startup

…s problem The devnet version of the script deploys a contract that inherits the mainnet ServiceNodeRewards contract but instead overrides the minimum exit time. The contract currently rejects exits that occur within the first 2hrs of the node registering. We don't have an easy way to get the nodes to fast-fwd their clocks by 2hrs to sign a signature a time in the future.

We don't support popping a block from the SQL DB anymore, only detach to a recent height which is supported by the SNL, ONS and SQL DB. This is required for deterministic behaviour on rewind. Since that's the case we can remove the subtract code from the DB and simplify.

This function always returns true meaning downstream checks on this call always succeed. This function can fail at the SQL insertion step. Predominantly, the pattern in the SQL DB is to use exceptions as the error mechanism and the one function that does call `reward_handler` is wrapped in a try-catch block. I've opted to remove the boolean return and instead annotate that the function throws. This simplifies the calling code in that it no longer has to do a success check (which was redundant since true was always returned) and instead optimistically executes and defers to the try-catch to handle errors.

Doy-lee added 26 commits November 22, 2024 09:30

Fix SN sort_and_filter being able to sort by shared_ptr contents

018f417

In practice this never became a problem because our list is a unique set of Ed25519 keys but for correctness we correctly compare _just_ the Ed25519 contents for sorting.

Move repeated parameters in process_confirmed_event into struct

75b0273

Correctly share STORE_LONG_TERM_STATE_INTERVAL in SNL and SQL

ed49bdc

Rename STORE_LONG_TERM_STATE_INTERVAL/STORE_RECENT_REWARDS -> HISTORY…

e023bc0

…_ARCHIVE_INTERVAL/HISTORY_KEEP_RECENT_WINDOW

Add throughput stats to subsystem rescan

8f8aab2

Initiate one SQL transaction at start of upgrade_schema

2c097a4

Ensure SNL and SQL cull/keep the same heights

f04c450

Remove pop_blocks API from SQLite and correctyl restore from archive/…

7e4bf14

…recent tableL

Make SNL detach call directly into SQL detach to lockstep the detach …

b6a4bdc

…step

Format non-complying code

8352356

Cleanup pass over PR, fix comments/grammar, remove fakechain SQL check

a91835b

Seems to run fine without the FAKECHAIN check. I checked the unit tests and they use a real SQL DB instance so there should be no issue _not_ gating the code from executing the code.

Serialize SNL state before updating

805abe1

Otherwise we store the latest height twice, instead of once

Detach ONS in the detach hook after all blocks popped

23d7c42

This matches behaviour with the other detach calls that detach after all blocks have been popped. This gets rid of the detach function in ONS which just called prune_db which we do so directly in the detach hook.

Ensure that SQL block is added instead of early exiting in SNL block add

5263911

Remove outdated comment about SQL DB not being present in FAKECHAIN

e2bce93

ONS check for nullptr, unit tests invokes w/o initialising DB

abd4b3f

Handle unit-tests asserting due to missing txin_gen in miner_tx w/ FA…

9a2c77d

…KECHAIN on startup

Check if the SQL DB is present for unit tests

f8a6cdb

Doy-lee force-pushed the doyle-fix-double-exit-payments-on-snl-detach branch from 334adf7 to f8a6cdb Compare November 21, 2024 22:30

Doy-lee changed the title ~~Doyle fix double exit payments on snl detach~~ Fix double exit payments on snl detach Nov 21, 2024

Doy-lee added 2 commits November 26, 2024 16:42

Add integration test for delayed_paments rollback

4cf9e1a

Verify batch db info height was undone as well in integration tests

8582024

Doy-lee added 3 commits November 27, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double exit payments on snl detach #1760

Fix double exit payments on snl detach #1760

Doy-lee commented Nov 21, 2024

Fix double exit payments on snl detach #1760

Are you sure you want to change the base?

Fix double exit payments on snl detach #1760

Conversation

Doy-lee commented Nov 21, 2024