[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

abdasgupta · 2022-06-28T09:34:43Z

Feature (What you would like to be added):
We need to reconcile ETCD cluster in case of quorum loss

Motivation (Why is this needed?):
When n/2+1 number of ETCD members are down in a cluster of size n, the cluster becomes unavailable. This situation is called Quorum Loss. To recover from quorum loss scenario, ETCD Druid needs to delete ETCD statefulset and redeploy again.
Approach/Hint to the implement solution (optional):
ETCD druid will delete the statefulset and redeploy the statefulset with replicas as many as replicas mentioned in ETCD CR

Flow:

There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check will be used by Druid
Druid will monitor the cluster health at a regular interval (Or should it monitor any other resource?)
If quorum loss is detected, Druid will put an annotation in ETCD CR to indicate that cluster is being restarted.
Druid will delete the ETCD statefulset and delete the PVC
Druid will deploy the statefulset with replicas = 1
Druid will wait for the one ETCD pod to restore properly (What’s the best way to detect if the first pod is restored properly?)
Once the restoration of first pod is complete, Druid will scale up the statefulset equal to the replicas mentioned in ETCD CR. Rest of the things will be handled by ETCD scale up mechanism

ashwani2k · 2022-08-04T05:48:44Z

Is implemented via - #365, #382

abdasgupta · 2022-08-29T19:50:46Z

The scope of the issue is to restore a multinode ETCD cluster after the quorum is lost. The steps are like following:

There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check is used by Druid
Druid custodion controller monitors the cluster health at a regular interval
If quorum loss is detected, custodion controller will put an annotation on ETCD CR to indicate that ETCD controller needs to fix quorum loss.
ETCD controller scales down the ETCD statefulset to 0 and delete the PVCs
ETCD controller deploys the statefulset with replicas = 1
ETCD controller scales up the statefulset equal to the replicas mentioned in ETCD CR. ETCD scale up mechanism in ETCD BR makes sure that the first instance of the StatefulSet is up and then the rests are added.

A PR (#382) is raised to take care of the above mentioned steps. In the step 5, we are scaling up ETCD from 1 to the number of the replicas mentioned in the CR. This step needs help from the scale up mechanism already implemented in ETCD BR (gardener/etcd-backup-restore#487). In the scale up mechanism, a multi node cluster should be able to scale up from 1 to 3 members. But that is not happening in co-ordination with quorum loss scenario that we are implementing. The newly added members due to scaling up are not joining to the cluster. Instead they are facing some errors which need to be resolved from ETCD BR side. I am currently studying why the errors are coming from ETCD BR side. This error was captured while we are trying to test the quorum loss setup with our local ETCD cluster. This is a manual test so far. Once we run this tests successfully we will test quorum loss scenario with local gardener setup.

Steps of the manual test that we are doing with our local ETCD cluster:

Run the druid controller
Create the ETCD CR.
3 ETCD pods will come up to form a ETCD cluster as we are setting repliacs as 3 in ETCD CR
Put some data in ETCD database so that incremental snapshots are taken
Apply a webhook that will not let new pods created.
Delete two ETCD pods
Quorum loss will happen. Druid will detect the quorum loss after 5 minute of buffer time.
Druid scale down the statefulset to 0. Then, druid will scale the statefulset up to 1
Remove the webhook that was not letting new pods to be created
At first the first pod will come up successfully.
Then the rest of the pods in 3 node cluster will try to join the cluster. Currently, the rest of the pods are not able to join the ETCD cluster due to some error.

abdasgupta · 2022-08-30T06:46:35Z

As mitigation plan until the quorum loss scenario is available, a DoD has to step in if quorum loss happens. S/he has to change the replicas in ETCD CR to 1. Scaledown the ETCD statefulset to zero, delete all of the PVCs and then scale up the Sts to 1.

timuthy · 2022-08-30T11:51:32Z

S/he has to change the replicas in ETCD CR to 1. Scaledown the ETCD statefulset to zero, delete all of the PVCs and then scale up the Sts to 1.

Do we really need to delete the volumes at all times? Quorum loss can happen for different reasons, e.g. 2/3 etcd pods were evicted due to resource shortage. Here it won't help and is even unnecessary to delete the volumes.
Also, there are etcds without backup: Can we recover manually from quorum loss w/o deleting the volumes at all? (see example)
Even if the operators can delete the volumes because backup-restore will take care to restore, they should at least double check the backup state before.

abdasgupta · 2022-09-21T08:08:25Z

Closing this issue as we have divided the problem in this issue into two and raised #436 and #437 to track them

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Jun 28, 2022

abdasgupta self-assigned this Jun 28, 2022

abdasgupta mentioned this issue Jun 28, 2022

Multi-Node/Clustered ETCD #107

Closed

34 tasks

abdasgupta modified the milestones: v0.11.0, v0.12.0 Jul 1, 2022

ashwani2k added the release/beta Planned for Beta release of the Feature label Jul 6, 2022

abdasgupta mentioned this issue Aug 4, 2022

Added the logic for quorum loss scenario #382

Closed

timuthy removed this from the v0.12.0 milestone Aug 5, 2022

abdasgupta mentioned this issue Sep 1, 2022

Updated the logics to accommodate with transient quorum loss scenario. gardener/etcd-backup-restore#528

Closed

abdasgupta added this to the v0.14.0 milestone Sep 5, 2022

abdasgupta closed this as completed Sep 21, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

abdasgupta commented Jun 28, 2022 •

edited

Loading

ashwani2k commented Aug 4, 2022

abdasgupta commented Aug 29, 2022

abdasgupta commented Aug 30, 2022

timuthy commented Aug 30, 2022

abdasgupta commented Sep 21, 2022

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

Comments

abdasgupta commented Jun 28, 2022 • edited Loading

ashwani2k commented Aug 4, 2022

abdasgupta commented Aug 29, 2022

abdasgupta commented Aug 30, 2022

timuthy commented Aug 30, 2022

abdasgupta commented Sep 21, 2022

abdasgupta commented Jun 28, 2022 •

edited

Loading