Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When syncing, replace misbehaving peers after a delay (minutes) instead of immediately #4941

Open
nfrisby opened this issue Aug 26, 2024 · 3 comments
Labels
Genesis high-priority high priority issues / PRs outbound-governor Issues / PRs related to outbound-governor

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Aug 26, 2024

On 2024 Aug 21, @crocodile-dentist reminded us of a request our two teams had previously discussed. This Issue makes that request concrete.

For the sake of tuning the Ouroboros Genesis parameters, the Tweag team worked out upper bounds for the most delay the adversary could induce at once. Those bounds assumed the adversary was not replenishing itself: ie immediately replacing adversarial peers that the syncing node disconnects from with a coordinated adversarial peer.

The worked out bounds become significantly worse if the adversary does successfully replenish itself, and then parameter tuning becomes much more delicate. The probability of replenishment --- especially multiple occurrences --- is quite low, unless the adversary controls a lot of stake. Even so, we are interested in effectively eliminating the risk of replenishment by delaying the replacement of misbehaving peers by 15min or so (edit: we still disconnect from them immediately, but delay before connecting to their replacement). Since it only applies to misbehaving peers, it should not slow down the sync, since we assume the syncing node has at least one honest peer.

If this delay turns out to be particularly onerous to implement, then perhaps the Consensus Team can reconsider investing more effort in revisiting the parameter tuning under the more difficult assumptions.

@crocodile-dentist
Copy link
Contributor

One concern I had with increasing the connection timeout is that under unfavorable conditions we may be continually dropping our big ledger peers. At the end of syncing, the network layer informs consensus on the OutboundConnectionState ie. whether we are in a trusted state or not. I envisioned that we may not want to signal a trusted state to proceed into deadline mode if we are connected to fewer than some arbitrary number of big ledger peers, which may be less than the genesis targets due to the timeout logic that would be introduced by this patch. Talking with Neil, we concluded that less than 5 is insufficient, but it can be adjusted in the node configuration. Nick, Is this a compatible or desirable operation from consensus perspective? At the last network weekly update meeting, Australia was given as an example where it is poorly connected and it may constantly rotate peers to download blocks from. If we end up with fewer than this configured number of big ledger peers at the end of syncing, the node will be stuck.

@amesgen
Copy link
Member

amesgen commented Aug 27, 2024

Thanks for bringing that up.

In practice, losing almost all of your connections is probably not due to a powerful adversary, but rather due to some network problem, which IIRC motivated your suggested mechanism as it prevents us from continuing to sync in such a case, and rather wait until we can establish connections to some minimum number of peers again. Is that right?


A scenario where such a mechanism (signalling that we are in an untrusted state when we are connected to very few peers) would be undesirable is if exactly one of our peers is honest: All but one of our peers would intentionally misbehave, hence be disconnected, and then they are not replaced by new peers for ~15min. Note that it would actually be completely fine to conclude that we are caught up even though we are connected to just this one peer as that peer is honest.

Under the assumption that there are enough honest nodes to eventually increase our peer count again, your suggested mechanism will still allow us to catch up eventually as we will eventually find new peers which are more likely to be honest, so that cycle can't continue forever.

@coot coot added outbound-governor Issues / PRs related to outbound-governor high-priority high priority issues / PRs labels Aug 28, 2024
@nfrisby
Copy link
Contributor Author

nfrisby commented Aug 28, 2024

In today's meeting, @karknu reminds us that we should avoid triggering this timeout for honest peers that are trying their best but failing to meet the expected performance (eg they suffer a huge GC pause).

This may be tricky. For example, if a peer offers an alternative chain but times out before filling that genesis window, then it's possible they're the honest peer suffering a performance hiccup. But that behavior matches a possible attack vector.

It seems possible to navigate this, eg only detain peers that timeout if their timeout occurs during a disagreement (an honest peer won't often timeout in ChainSync). But it certainly deserves some thought regarding robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Genesis high-priority high priority issues / PRs outbound-governor Issues / PRs related to outbound-governor
Projects
Status: No status
Development

No branches or pull requests

4 participants