When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

nfrisby · 2024-08-26T15:05:33Z

On 2024 Aug 21, @crocodile-dentist reminded us of a request our two teams had previously discussed. This Issue makes that request concrete.

For the sake of tuning the Ouroboros Genesis parameters, the Tweag team worked out upper bounds for the most delay the adversary could induce at once. Those bounds assumed the adversary was not replenishing itself: ie immediately replacing adversarial peers that the syncing node disconnects from with a coordinated adversarial peer.

The worked out bounds become significantly worse if the adversary does successfully replenish itself, and then parameter tuning becomes much more delicate. The probability of replenishment --- especially multiple occurrences --- is quite low, unless the adversary controls a lot of stake. Even so, we are interested in effectively eliminating the risk of replenishment by delaying the replacement of misbehaving peers by 15min or so (edit: we still disconnect from them immediately, but delay before connecting to their replacement). Since it only applies to misbehaving peers, it should not slow down the sync, since we assume the syncing node has at least one honest peer.

If this delay turns out to be particularly onerous to implement, then perhaps the Consensus Team can reconsider investing more effort in revisiting the parameter tuning under the more difficult assumptions.

crocodile-dentist · 2024-08-26T16:04:28Z

One concern I had with increasing the connection timeout is that under unfavorable conditions we may be continually dropping our big ledger peers. At the end of syncing, the network layer informs consensus on the OutboundConnectionState ie. whether we are in a trusted state or not. I envisioned that we may not want to signal a trusted state to proceed into deadline mode if we are connected to fewer than some arbitrary number of big ledger peers, which may be less than the genesis targets due to the timeout logic that would be introduced by this patch. Talking with Neil, we concluded that less than 5 is insufficient, but it can be adjusted in the node configuration. Nick, Is this a compatible or desirable operation from consensus perspective? At the last network weekly update meeting, Australia was given as an example where it is poorly connected and it may constantly rotate peers to download blocks from. If we end up with fewer than this configured number of big ledger peers at the end of syncing, the node will be stuck.

amesgen · 2024-08-27T08:50:34Z

Thanks for bringing that up.

In practice, losing almost all of your connections is probably not due to a powerful adversary, but rather due to some network problem, which IIRC motivated your suggested mechanism as it prevents us from continuing to sync in such a case, and rather wait until we can establish connections to some minimum number of peers again. Is that right?

A scenario where such a mechanism (signalling that we are in an untrusted state when we are connected to very few peers) would be undesirable is if exactly one of our peers is honest: All but one of our peers would intentionally misbehave, hence be disconnected, and then they are not replaced by new peers for ~15min. Note that it would actually be completely fine to conclude that we are caught up even though we are connected to just this one peer as that peer is honest.

Under the assumption that there are enough honest nodes to eventually increase our peer count again, your suggested mechanism will still allow us to catch up eventually as we will eventually find new peers which are more likely to be honest, so that cycle can't continue forever.

nfrisby · 2024-08-28T16:29:31Z

In today's meeting, @karknu reminds us that we should avoid triggering this timeout for honest peers that are trying their best but failing to meet the expected performance (eg they suffer a huge GC pause).

This may be tricky. For example, if a peer offers an alternative chain but times out before filling that genesis window, then it's possible they're the honest peer suffering a performance hiccup. But that behavior matches a possible attack vector.

It seems possible to navigate this, eg only detain peers that timeout if their timeout occurs during a disagreement (an honest peer won't often timeout in ChainSync). But it certainly deserves some thought regarding robustness.

nfrisby added the Genesis label Aug 26, 2024

github-project-automation bot added this to Ouroboros Network Aug 26, 2024

coot added outbound-governor Issues / PRs related to outbound-governor high-priority high priority issues / PRs labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

nfrisby commented Aug 26, 2024 •

edited

Loading

crocodile-dentist commented Aug 26, 2024

amesgen commented Aug 27, 2024

nfrisby commented Aug 28, 2024

When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

Comments

nfrisby commented Aug 26, 2024 • edited Loading

crocodile-dentist commented Aug 26, 2024

amesgen commented Aug 27, 2024

nfrisby commented Aug 28, 2024

nfrisby commented Aug 26, 2024 •

edited

Loading