Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: ways to improve Ethereum block proposer duty flow #1829

Open
iurii-ssv opened this issue Oct 29, 2024 · 2 comments
Open

discussion: ways to improve Ethereum block proposer duty flow #1829

iurii-ssv opened this issue Oct 29, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@iurii-ssv
Copy link
Contributor

iurii-ssv commented Oct 29, 2024

I'll post the findings from recent Discord discussion(s) here - so we can have it documented to revisit later (cause it seems important),

The problem

  1. It seems relays are blocking any proposals past 4s mark of the current slot. Would it be possible to add a bypass mechanism after 4s, so that a non-MEV block is submitted? (instead of a complete miss)

  2. Iurii mentioned our 1 and 2 rounds of proposal duty allocate 4s unevenly / sub-optimally. It might make sense to review this mechanism and timings.

  3. another point that's not 100% clear to me is why we are starting proposer duty exactly at the start of targeted slot, I understand the currently written code works this way - but is there some fundamental limitation (perhaps DVT-related) to do it like that (answering myself: probably not) ?

from this article - https://www.blocknative.com/blog/anatomy-of-a-slot - it seems the start of targeted slot is the time where most blocks already get proposed (so they have enough time to spread through Ethereum network)

4s since slot start (call it "soft limit") also seems to be quite late/risky (otherwise the chart below would look different I think), we might want to limit it to 2.5s instead of example - and that will render round 2 useless/unnecessary btw

Before t = 0 there are safe proposals that ensure propagation happens and so you don't miss out on the slot. There are also a lot of actors that wait until very late, 2.5 seconds in, who are taking their chances with propagation delays. So that's the general distribution of block availability over time relative to that slot boundary. You can call it the t = 0 or t=12.

image

Potential solution - Proposal 1

There is trade-off between picking/broadcasting the proposed block sooner/later:

  • we do want to propose a block built at the latest possible moment (eg. it becomes available 1s after slot start), call it profitable block
  • yet we can't risk to wait too long (especially because DVT adds extra time to do its thing, compared to single-node Ethereum setup), so we want to have a backup block

I'll record another round of thoughts I have here on these 1-3 findings from above, I think we can do something like this:

  • lets forget for a second about how hard it is to implement in code, and how much additional resources (cpu/memory/bandwidth) it will consume just to get to theoretical best solution; and then we can improve upon it further, compromising and such
  • backup block should be built early enough so we never really have any "delay" issues with it (doesn't matter how we get it, lets say we expect to get it at -4s seconds before target slot start time - call it Tb); we want to achieve SSV qbft consensus on backup block as soon as possible, but not sign on it (post-consensus phase) just yet because we don't want to double-sign (and get slashed)
  • then comes a time to build a profitable block (doesn't matter how we get it, lets say we expect to get it 1s after target slot start time - call it Tp); we want to achieve SSV qbft consensus on profitable block as soon as possible, but unlike for backup block there is a high likelihood we won't be able to do it in time (lets call this deadline time Td, that could be 2.5s after target slot start time for example); this profitable block either will get decided upon by qbft or not before deadline Td
  • so each SSV node either waits util Td time passes and enters post-consensus phase for backup block (to sign it, and broadcast to Ethereum) or it reaches qbft consensus on profitable block before Td and ditches backup block altogether moving to post-consensus phase
  • additionally, to get the most out of the time block proposer duty runner has (especially for profitable block) we want to gather some production data (perhaps we already have it) on how much time round 1, round 2, ... typically take - and from this data we can estimate the best qbft configuration (how many rounds we want to do in that short timespan, and what would the timeout for each round be)

Will this work, or am I missing something ? If something like this could work, I guess it's not easy to implement straight away but we can slowly progress towards it. It seems like an important problem to solve.
Regarding additional resource (cpu/mem/...) consumption, sure it will cost some - but I think we won't have to query Beacon node for 2 blocks at the same time much (if at all), and it seems like block proposals are quite rare to matter in that respect anyway ?

regarding external factors (beacon node, relay):

  • we have no gurantees on how much time we will spend building a block (because we request it from external source that could delay forever); I guess we just need to pick appropriate values for Tb, Tp and Td to fit the most common scenarios (and maybe some Beacon node request-retries, if not already)
  • even if SSV nodes do their job on time we might get unexpected delay(s) from beacon node(s) when broadcasting signed block (although it helps that multiple SSV nodes are broadcasting it - 1 or 2 failed beacon nodes might be fine); we can adjust Tp and Td values to get the best results

@iurii I think what might be a problem is that an operator doesn't know if others successfully submitted. If they have then it will lead to slashing. And I think there's no reliable way to find this out unless we add another consensus layer on whether they submitted

are we talking about the slashable Ethereum offense known as double signing ?

I believe for block proposal (for attestations it's similar but somewhat different) it means - Ethereum can punish validator if he has signed 2 different blocks (headers) and both of these blocks were observed by somebody (I think it might be called Watch-tower or fisher maybe), and that somebody created a proof of that and sent it out to Ethereum nodes to verify

so, if that's how Ethereum slashing for block production works - in the approach I outlined above we actually never sign 2 different blocks, only 1 block will ever be signed (ofc we'll probably need to add/adjust some logic that SSV node only ever signs 1 block at post-consesus phase even though it might have 2 blocks at hand after finishing 2 qbft consensuses prior to that - thus, post-consensus quorum needed to reconstruct validator signature can only be reached for at most 1 of the 2 blocks every SSV node prepared for target slot)

and as for who/how signed block is submitted to Ethereum network, I believe there isn't any issues with broadcasting such block from multiple different Beacon nodes at the same time (or different times) - in fact it is probably better if we can do multiple such broadcasts because then this block will reach all Ethereum nodes sooner (for Ethereum validators to be able attest to it).

@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Oct 30, 2024

Another note, profitable block QBFT (call it 2nd QFBT, or QFBT 2) should probably have exactly 1 round (not 2 or even more) because if not - that would mean we are not as profitable as we could be.

Proposal 2 (pushing the approach from above to its limits)

To take a step back to explain why do 2 QBFT consensuses, ideally when proposing Ethereum blocks we want to have these 2 properties:

  1. "proposing operator" rotation (important for preventing any particular operator from conducting funny business, like selling his privilege "to convince validator cluster to propose certain block(s), or kinds of blocks" through a side-channel); note, this is different from QBFT Instance leader rotation that exists to ensure QBFT liveness only, "proposing operator" rotation relies on it kind of deriving this desired property
  2. every operator wants to know whether he needs to sign profitable block or backup block to successfully finish the duty - this is why we do 2nd QBFT (on profitable block in the algo described above) so that upon commit-quorum every operator knows what other operator(s) gonna do (honest operators, to be precise) and can broadcast the correct block of the 2 for BLS-signing

property 1) is nice to have, but perhaps can be somewhat relaxed if we want to get the most out of MEV; mentioning just for completeness (for further considerations we might have on it), stuff described below doesn't compromise on it

property 2) however is a binary thing (either we have it, or we don't) and not having it means operators will miss proposing Ethereum block if they couldn't correctly guess & agree on which of backup / profitable to propose; but (again, to get the most out of MEV) we might consider an approach where we do QBFT 1 consensus on backup block first (like in the approach above), but then we throw away backup block, take a gamble and propose profitable block every time without doing full QBFT 2 consensus on it, but rather just hoping that profitable block will be spread out to enough operators for them to sign with their validator share - this seems to make sense to do for 2 reasons:

  • if we can arrange it this way - it would be best that resulting leader of QBFT 1 (which can have several rounds) will be chosen to also lead round 1 of QBFT 2 (which is the only round) - this is because this same leader getting QBFT consensus on backup block means he is likely gonna be online and well connected to other peers (and to beacon node) in the next ~5 seconds to follow, which means he is very likely to succeed with proposing profitable block and hence we can optimistically/prematurely terminate 2nd QBFT right after proposal step (without doing prepare/commit confirmations) but much faster as the result(good for MEV); we can't say the same "liveness" heuristic is true for other operators - hence re-using the leader from QBFT 1 is preferable
  • we do the final BLS-signing to reconstruct validator signature at time Td (or sooner if we got an agreement on profitable block sooner), this involves everyone broadcasting their partial signature + receiving everyone else's; this could fail due to p2p networking issues (and nothing else) - which happens to be the only reason for QBFT 2 proposal phase to fail; hence doing QBFT 2 proposal phase only + BLS-signing has roughly similar success chance as doing just BLS-signing, so we might as well just omit QBFT 2 prepare/commit phases altogether

we don't really need backup block at all (hence we throw it away) - what we really need is to "check networking conditions right before we are about to propose profitable block" (and select the leader who is best suited to do that - the leader proven by QBFT 1 consensus).

This approach relies on a couple of hypothesis that need to be verified against real-world cluster data (preferably production data that maybe we can gather by implementing a "dry-run" version of this approach on prod cluster(s)) - even though this approach seems risker (compared to approach with backup block there will be more missed blocks most likely) - it might yield higher expected reward to Ethereum validator(s) over time (because of MEV).

@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Nov 2, 2024

Side-note, so far (in Proposal 1 and Proposal 2 from above) we've treated Beacon(and Relayer) node as "black box" assuming we can't expect a timely response from it along the lines of "give me the best you can in 2s" - this is reasonable to not rely on assumptions like that ofc, but perhaps we could emulate this functionality by sending multiple sequential requests to Beacon node at Td - 3, Td - 2, Td - 1 (what we are really targeting is Relayer infrastructure actually, because it's likely that operators run Beacon node itself close to SSV node and monitor it's health closely) and take the latest successful response cancelling the rest. This needs looking into.

Proposal 3 (best overall, complex? cryptography, maybe dead end)

I'm no cryptography expert, but I think some form of threshold/conditional cryptography should exist to make the following Proposal 3 viable.

Lets say we have a cluster of 4 operators (servicing same validator V), each of them in parallel (like we currently do) starts requesting a block from Beacon node and tries to broadcast it to the rest of network (not signing with V-share on it yet, spreading it throughout p2p network only), lets say this process starts at time Td-1s (1s is just an arbitrary number)

at time Td this broadcasting phase ends - every node will have 0-4 blocks (from itself, and every other node),

let say for the taget slot QBFT leader ordering will be (2->3->4->1) - if we were to start doing QBFT ... we won't, instead every operator will sign ALL 4 blocks with his V-share but in a smart way - encrypting every V-share signature building a 4-layer onion-like object that will be broadcast to every other operator to receive as much of these objects as they can until another deadline happens at Td + 1s(1s is just an arbitrary number),

thus, at Td + 1s every operator has 0 or 1 messages(objects) from other operators, these objects contain 4 signed blocks (if there was no block to sign, operators just sign dummy-0-value block) - each signed block encrypted, residing at it's own layer (1 to 4) in this onion-like object:

  • (as per leader order 2->3->4->1) because leader 2 is the leader with highest priority for this slot every operator knows to try to reconstruct & broadcast this leader's Etehreum-block proposal if he can (from all the objects he's got) - so, first layer of the object is not encrypted (only 3 lower layers are) because we need to start somewhere
  • operator checks if he can reconstruct V-signature from layer 1 (that contains proposed block of leader 2), if he can - he does, and we just send the V-signed block to Beacon node - and we are done; if he can't he needs to construct a proof that quorum for this layer can never happen (V-signature can never exist) - remember, everybody has received 0-4 of objects differing only in "what signed blocks are available to them at what layers" (if receiving operator is missing an object for some operator M - he will treat it as if he has an ALL-value object from operator M - that is he'll assume operator M signed nothing everything!); so essentially we need to construct a cryptographic proof (assuming the math to do it exists) that shows a partial quorum (1/3) of operators who "didn't see a block for the leader on this level and signed off on not seeing", accounting for absent signatures and treat them as if they exist (ALL-value) - hence it's not always a partial quorum of cluster (4 in our example) but sometimes it's a partial quorum of cluster_size-1 (or cluster_size-2, etc.)
  • once we have that proof ^ we can use it to decrypt the next layer (for the next leader in line), and we repeat this util a layer actually yields a V-signature (and if it does, there is no way to produce a proof of existence of "honest partial quorum signing of on not seeing" - meaning no way to decrypt the partial signatures from the next layer to reconstruct a 2nd V-signature for a different block) or until we run out of layers

additional notes:

  • the best part is - we will have this "honest" partial quorum every time we actually need it (that is - every time p2p networking conditions are good enough they allow for enough messages to spread around to reconstruct V-signature; if p2p network is doing poorly - we are doomed anyway)
  • doing those quorum/threshold checks mentioned above we need to keep in mind that we need to exclude the checking-operator himself because he might withhold his own signature(s)
  • when operator is checking whether there is a quorum for a block at a certain layer (by a certain leader) he must not only verify that each signature (obj1->layer1Block, obj2->layer1Block, ...) is valid and belongs to the operator msg claims it belongs to - but also that every block on each layer is exactly the same block (meaning, layer leader doesn't try to broadcast 2 or more different blocks - he can broadcast at most 1)
  • (optimization) because we know the order of leaders for the target slot ahead of time, we know that it's unlikely that we often will fall back to leader 4,5,6... - hence we might cut down on p2p messages we send around by simply acknowledging this fact and restricting this "next leader in line rotation" to 3 or 4 leaders (especially makes sense to apply this for clusters of 13+ operators)
  • (optimization) one other consideration to verify is whether any operator can fetch block-body & broadcast it through any relayer; if not every operator can - we probably want to handle this by ignoring block proposals that have block without proposing-leader signature for it (means, leader was able to broadcast the block he intends to propose, but wasn't able to broadcast signed layered object - which likely means that leader likely went offline and won't be able to finish the job, hence we better change to the next leader/layer because we can do it with almost 0 overhead)

This proposal will probably work best overall - in terms of treating Beacon(and Relayer) node as "black box" (yet having high Ethereum block proposal availability) as well as predictable/guaranteed latency of overall execution on SSV node(s) side (allows us to delay Ethereum block building/fetching as much as possible - for best MEV)

The downsides are:

  • complex? cryptography (if it even exists - might be hard to implement), and complex code with edge-cases that aren't easy to test
  • (depending on what we are comparing with) possibly higher network traffic - for blinded blocks (headers, signed headers) p2p traffic might actually be even lower than for Proposal 1 and 2 (because they involve exchanging QBFT messages), but for full blocks it could be prohibitively expensive (or just too unreliable / high-latency) to send up to 3-4 large blocks around just for 1 target slot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant