[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437
Labels
kind/enhancement
Enhancement, improvement, extension
release/ga
Planned for GA(General Availability) release of the Feature
status/closed
Issue is closed (either delivered or triaged)
Feature (What you would like to be added):
ETCD multinode cluster should be able to recover from permanent quorum loss in a graceful way after human intervention.
Motivation (Why is this needed?):
Permanent quorum loss happens when most number(>=n/2+1) of ETCD member nodes are down due to disk failure completely. in such quorum loss, the cluster does not recover automatically as the ETCD data in disks are unavailable for most of the ETCD member nodes. The remaining nodes also can't take in new ETCD members as the quorum is already lost. The cluster can't process any new requests thereafter. A human operator must detect such state of cluster and should recover the cluster gracefully, possibly without losing any data.
Approach/Hint to the implement solution (optional):
As this situation is highly unlikely to happen in a running cluster in production, we are yet to decide a concrete design that would serve the purpose. One thing we decided so far that in such scenario, a human operator has to intervene and take decision with what to do. We are already preparing a playbook for human operator to take action in such scenario. The playbook mainly ask the human operator to execute the following steps:
We may need to add some intermediary steps in the above playbook steps based on our finding from our testing. We may automate some of the steps in the above playbook to ease the work of human operator. But even if we automate, we should not delete the PVs automatically and instead preserve the old data of remaining ETCD members in a separate folder (will require some parts of #382)
Additionally, we are also thinking of following implementations:
The text was updated successfully, but these errors were encountered: