[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

abdasgupta · 2022-09-21T08:06:20Z

Feature (What you would like to be added):

ETCD multinode cluster should be able to recover from permanent quorum loss in a graceful way after human intervention.

Motivation (Why is this needed?):
Permanent quorum loss happens when most number(>=n/2+1) of ETCD member nodes are down due to disk failure completely. in such quorum loss, the cluster does not recover automatically as the ETCD data in disks are unavailable for most of the ETCD member nodes. The remaining nodes also can't take in new ETCD members as the quorum is already lost. The cluster can't process any new requests thereafter. A human operator must detect such state of cluster and should recover the cluster gracefully, possibly without losing any data.

Approach/Hint to the implement solution (optional):
As this situation is highly unlikely to happen in a running cluster in production, we are yet to decide a concrete design that would serve the purpose. One thing we decided so far that in such scenario, a human operator has to intervene and take decision with what to do. We are already preparing a playbook for human operator to take action in such scenario. The playbook mainly ask the human operator to execute the following steps:

Scale down the sts replicas to 0
Delete all 3 PVCs
Scale up the sts to 1
Wait for the single member etcd to come up, download snapshots and start running normally
Scale up the STS to desired replicas.
Verify that the cluster is formed correctly

We may need to add some intermediary steps in the above playbook steps based on our finding from our testing. We may automate some of the steps in the above playbook to ease the work of human operator. But even if we automate, we should not delete the PVs automatically and instead preserve the old data of remaining ETCD members in a separate folder (will require some parts of #382)

Additionally, we are also thinking of following implementations:

Let the leader always take a final incremental snapshot after it loses leadership (independent, but relevant nonetheless: we should avoid double-application of the same revisions, e.g. 500-2000 should trump 500-1000) . If we always have the latest snapshots in our backup bucket even for quorum loss case, we don't need to worry about PVCs. We can safely discard all the PVCs and restore a cluster from the backup bucket.2.

abdasgupta · 2022-11-14T06:01:55Z

We are not handling permanent quorum loss automatically, as of now. If the ETCD cluster face a permanent quorum loss, a human operator needs to intervene. We wrote a playbook to guide the human operator to bring up the cluster from permanent quorum loss.

I am closing this issue, as of now. If we need to add any additional flow for handling permanent quorum loss in more automatic way, we will raise another issue and track our progress there.

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Sep 21, 2022

abdasgupta mentioned this issue Oct 10, 2022

Stopped reconciliation if quorumloss annotation is set by operator. #446

Merged

ashwani2k added the release/ga Planned for GA(General Availability) release of the Feature label Oct 11, 2022

abdasgupta closed this as completed Nov 14, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

abdasgupta commented Sep 21, 2022

abdasgupta commented Nov 14, 2022

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

Comments

abdasgupta commented Sep 21, 2022

abdasgupta commented Nov 14, 2022