forked from monero-project/monero
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double exit payments on snl detach #1760
Open
Doy-lee
wants to merge
31
commits into
oxen-io:dev
Choose a base branch
from
Doy-lee:doyle-fix-double-exit-payments-on-snl-detach
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix double exit payments on snl detach #1760
Doy-lee
wants to merge
31
commits into
oxen-io:dev
from
Doy-lee:doyle-fix-double-exit-payments-on-snl-detach
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Store hardfork heights at the top of the function to reduce line length and minimise the amount of line-breaking enforced by the formatter. This breaks up the text-heavy noise in the height enumeration step. - Separate the step where we calculate the minimum height of all the subsystems which is the height to start loading blocks from to after enumerating all the heights. This makes it very clear what heights we need to run the min() function on and find the earliest height. - Remove the vector accumulating heights. It's clearer now what heights to min() now that the enumerating height and processing the height is separated. - Avoid signed int math. Unsigned works fine by relying on well-defined wrapping behaviour which removes the noisy static_casts in the main loop. - Remove early exit checking on there being no blocks to check. The bottom of the function had the exact same check. The loop is well behaved for (total_blocks == 0) so we can just let the function proceed rather than add another branch.
In practice this never became a problem because our list is a unique set of Ed25519 keys but for correctness we correctly compare _just_ the Ed25519 contents for sorting.
We do a dupe insertion check against the {block,tx,contrib} index which is guaranteed to be unique. Note we do _not_ check/store block hash because our threat model here is that the block hash is valid by the time this part of the code is met (e.g. it passed all of blockchain and SNL validation essentially). To do this we have to pipe through some new information from the SNL when it calls into SQL DB to insert a delayed payment (i.e. return of stake). This makes it so that essentially there is 1 row for each returned stake as they are uniquely identified by the {block,tx,contrib} index.
…_ARCHIVE_INTERVAL/HISTORY_KEEP_RECENT_WINDOW
Incremental steps to getting the the SQL and SNL to handle blockchain_detached in the same manner. They both need to detach to exactly the same height so they must also store backups to the same degree on both sides so that one subsystem doesn't desync on detach. If they desync, we risk the 2 systems producing inconsistent rewards which breaks BLS aggregation.
- Short term serialisation was broken because the height that was calculated as the cutoff height to serialise was incorrectly adding a offset to the current height before being subtracted again such that all the SNL states that were serialised only serialised their quorums. This means that every time we reorg, 1 or 2 blocks, this triggers a full rescan from the last 10k-th block interval. This was hidden in the code because the logic was hard to read and modify. - The serialisation routine for some reason was also trying to serialise the quorums at a cut-off height, but, when you process a block into the SNL you _already_ cull history that is too old. Hence when you go to serialise, the list of states has already been pruned, there's no need to have logic to prune it again. All we have to do is serialise it to disk. That simplifies the serialisation routine which unveiled the bug in the above bulletpoint. - Remove tricky logic that migrated the short-term window into the archive window. This had a bunch of checks embedded in a loop with some arithmetic to calculate some offsets from the height. This is not necessary if we just serialise the state on block add by determining if the current state falls under the criteria to be serialised into the archive or not. This simplifies the loop, de-nests the algorithm and hence reduces the cognitive load required to grok what's happening. - Fix an issue where SNL::load() calls reset(false). Reset was clearing out the SQL DB causing a full-rescan on startup every time even if the SNL and SQL DB's were in sync.
Unifies the amount of archive/recent state kept by the SQL and SNL DB. This means that when the systems detach, they are able to detach to the same height as each other and then able to resync the chain from the same starting point. Syncing from the same starting point is paramount to ensuring that the same results are derived as loading blocks into the SNL has side effects on the SQL DB.
This ensures that the first SNL state we store in our recent history matches the first SQL state that the SQL DB stores. This means a slight adjustment in the SNL serialisation code in that it has to serialise itself, 'm_state' in tandem with 'm_transient.state_history'.
The wrapper allows us to check and assert if the detach popped us to the right block for testing and verifying that the detach is working. Small rewrite of the detach logic in SNL to read top-to-bottom with minimal branching or early returns.
We detach to a height that is supported by both SNL and SQL by checking the backups available on both before proceeding. This ensures in exceptional cases if the SQL side is missing a backup, the SNL doesn't go ahead and rewind itself. This would cause a desync that can only be remedied by noticing it, and then restarting the daemon. Instead the 2 double check what is a safe height they both support and do that patching up that particular avenue of breakage.
Seems to run fine without the FAKECHAIN check. I checked the unit tests and they use a real SQL DB instance so there should be no issue _not_ gating the code from executing the code.
Otherwise we store the latest height twice, instead of once
This matches behaviour with the other detach calls that detach after all blocks have been popped. This gets rid of the detach function in ONS which just called prune_db which we do so directly in the detach hook.
…KECHAIN on startup
Doy-lee
force-pushed
the
doyle-fix-double-exit-payments-on-snl-detach
branch
from
November 21, 2024 22:30
334adf7
to
f8a6cdb
Compare
Doy-lee
changed the title
Doyle fix double exit payments on snl detach
Fix double exit payments on snl detach
Nov 21, 2024
…s problem The devnet version of the script deploys a contract that inherits the mainnet ServiceNodeRewards contract but instead overrides the minimum exit time. The contract currently rejects exits that occur within the first 2hrs of the node registering. We don't have an easy way to get the nodes to fast-fwd their clocks by 2hrs to sign a signature a time in the future.
We don't support popping a block from the SQL DB anymore, only detach to a recent height which is supported by the SNL, ONS and SQL DB. This is required for deterministic behaviour on rewind. Since that's the case we can remove the subtract code from the DB and simplify.
This function always returns true meaning downstream checks on this call always succeed. This function can fail at the SQL insertion step. Predominantly, the pattern in the SQL DB is to use exceptions as the error mechanism and the one function that does call `reward_handler` is wrapped in a try-catch block. I've opted to remove the boolean return and instead annotate that the function throws. This simplifies the calling code in that it no longer has to do a success check (which was redundant since true was always returned) and instead optimistically executes and defers to the try-catch to handle errors.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The general premise here is to ensure that blockchain detach, detaches the SNL and SQL DB to the exact same height. This ensures that the systems rewind back and resync from the same starting point. Then, if the act of syncing the SNL affects the SQL DB like inserting new rows, this is handled correctly as the SQL DB is also at the exact same height as the SNL as to when the row was produced and any logic in the SQL DB that relies on that information is handled correctly.
The best way to do this is to just embed the SQL DB into the SNL and lock-step them together because they are inherently linked to each other (if you add a block to the SNL and not the SQL DB then the ledger becomes incorrect). This is a hard requirement in the sequence of API calls you need to do when a block is added for correctness. That'd be a very large and disruptive change that I've not done. More reasonably I've deleted some public APIs on the SQL DB and merged them into the SNL to ensure the SQL is updated atomically with the SNL.
This leads to the following changes in the PR:
network_config.h
HISTORY_ARCHIVE_INTERVAL
: The number of blocks between that we store a backup of the state, set to 10k blocks. This is the same value as we had before. These are stored for a rolling window ofHISTORY_ARCHIVE_KEEP_WINDOW
blocks. Archived state is for big re-orgs (typically invoked by using pop blocks manually).HISTORY_ARCHIVE_KEEP_WINDOW
: The rolling window of how many archives we keep. This is set to 2 years. Before this PR we kept them in perpetuity. Realistically this is not necessary and I've added this to bound the size of this over time.HISTORY_RECENT_KEEP_WINDOW
: How many blocks worth of state from the tip should be kept. This is for short re-orgs. This was previously a local variable in the SNL. This was lifted to be global so that it can be shared by the SQL DB.Now both SNL and SQL DB use these variables to determine when they take a snapshot of their current state and in doing so both store the exact same number of backups that they can then detach to when requested and hence resync from the same starting point. The calculation and backup of state is done in the SNL and SQL individually which is ripe for bugs as the SNL has to backup using the binary serialisation system and the SQL DB using SQL queries. To guard against bugs and in this and futurebugs I've added a bunch of sanity checks and asserts in various places to try and catch and recover if they go out of sync.
(block_height, tx_index, contributor_index)
which outright prevents the SQL from inserting payment for a node twice if it somehow was made to process a block twice. This blocks the issue we're encountering now on stagenet where a rescan was re-processing exits into the SQL DB.exec_detach_hooks
onBlockchain::init
) which means that if at any point the daemon crashes, systems go out of sync, a restart should detect this and get them to resync so it's fairly more resilient over time to bugs.Part of this change required was figuring out how the old code worked, what it did, e.t.c and so bug fixes and improvements were made in the pre-existing process that I believe make it more robust and understandable.
Which was offsetting the height, and then again in the function offsetting the height argument to something OOB which would trigger the serialiser to only store quorums. That's fixed now having reviewed the serialisation routines. Removed most of the branches and the scary height calculations and made it just read top-to-bottom with explanations of each step.
filter_and_sort
sorting by the contents ofpubkey_and_info
instead of justpubkey