-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPERIMENT] stub test for harmless=true #1555
base: main
Are you sure you want to change the base?
Conversation
pjfanning
commented
Nov 8, 2024
•
edited
Loading
edited
- relates to Clustering issues leading to all nodes being downed #578
- basic test that now watches for the quarantine event and with an experimental change to try to suppress that quarantine event when harmless=true
Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala
Without an active SBR, no node will be shutdown: it is the SBR that downs itself when receiving "eliminate quarantined association when not used (harmless=true)" in withAssociation {
(remoteSystem, remoteAddress, _, localArtery, localProbe) =>
remoteSystem.eventStream.subscribe(testActor, classOf[ThisActorSystemQuarantinedEvent]) // event to watch out for, indicator of the issue
val remoteEcho = remoteSystem.actorSelection("/user/echo").resolveOne(remainingOrDefault).futureValue
val localAddress = RARP(system).provider.getDefaultAddress
val localEchoRef = remoteSystem.actorSelection(RootActorPath(localAddress) / localProbe.ref.path.elements).resolveOne(remainingOrDefault).futureValue
remoteEcho.tell("ping", localEchoRef)
localProbe.expectMsg("ping")
val association = localArtery.association(remoteAddress)
val remoteUid = futureUniqueRemoteAddress(association).futureValue.uid
localArtery.quarantine(remoteAddress, Some(remoteUid), "HarmlessTest", harmless = true)
association.associationState.isQuarantined(remoteUid) shouldBe true
eventually {
remoteEcho.tell("ping", localEchoRef) // trigger sending message from remote to local, which will trigger local to wrongfully notify remote that it is quarantined
expectMsgType[ThisActorSystemQuarantinedEvent] // this is what remote emits when it learns it is quarantined by local. This is not correct and is what (with SBR enabled) triggers killing the node.
}
} |
I added the new test case but I am aware that it needs to be moved to the cluster or cluster-tests projects and the Split Brain Resolver added. I am busy on other tasks so don't expect to get back to this for a while. |
What would it add to move the test to the cluster or cluster-tests projects? To me this is a bug of the |
I've added a change to InboundQuarantineCheck based on #578 (comment). This may not be the best solution but it seems to help in this one test case. |
It seems good to me like that, thank you! |
@raboof @mdedetrich @jrudolph what do you think about the runtime change? We could add a config to users to control if the new runtime check is enabled. |