Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

Closed
abdasgupta opened this issue Jun 28, 2022 · 5 comments
Closed

[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

abdasgupta opened this issue Jun 28, 2022 · 5 comments
Assignees
Labels
kind/enhancement Enhancement, improvement, extension release/beta Planned for Beta release of the Feature status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@abdasgupta
Copy link
Contributor

abdasgupta commented Jun 28, 2022

Feature (What you would like to be added):
We need to reconcile ETCD cluster in case of quorum loss

Motivation (Why is this needed?):
When n/2+1 number of ETCD members are down in a cluster of size n, the cluster becomes unavailable. This situation is called Quorum Loss. To recover from quorum loss scenario, ETCD Druid needs to delete ETCD statefulset and redeploy again.
Approach/Hint to the implement solution (optional):
ETCD druid will delete the statefulset and redeploy the statefulset with replicas as many as replicas mentioned in ETCD CR

Flow:

  1. There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check will be used by Druid
  2. Druid will monitor the cluster health at a regular interval (Or should it monitor any other resource?)
    If quorum loss is detected, Druid will put an annotation in ETCD CR to indicate that cluster is being restarted.
  3. Druid will delete the ETCD statefulset and delete the PVC
  4. Druid will deploy the statefulset with replicas = 1
  5. Druid will wait for the one ETCD pod to restore properly (What’s the best way to detect if the first pod is restored properly?)
  6. Once the restoration of first pod is complete, Druid will scale up the statefulset equal to the replicas mentioned in ETCD CR. Rest of the things will be handled by ETCD scale up mechanism
@abdasgupta abdasgupta added the kind/enhancement Enhancement, improvement, extension label Jun 28, 2022
@abdasgupta abdasgupta self-assigned this Jun 28, 2022
@abdasgupta abdasgupta modified the milestones: v0.11.0, v0.12.0 Jul 1, 2022
@ashwani2k ashwani2k added the release/beta Planned for Beta release of the Feature label Jul 6, 2022
@ashwani2k
Copy link
Collaborator

Is implemented via - #365, #382

@abdasgupta
Copy link
Contributor Author

The scope of the issue is to restore a multinode ETCD cluster after the quorum is lost. The steps are like following:

  1. There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check is used by Druid
  2. Druid custodion controller monitors the cluster health at a regular interval
    If quorum loss is detected, custodion controller will put an annotation on ETCD CR to indicate that ETCD controller needs to fix quorum loss.
  3. ETCD controller scales down the ETCD statefulset to 0 and delete the PVCs
  4. ETCD controller deploys the statefulset with replicas = 1
  5. ETCD controller scales up the statefulset equal to the replicas mentioned in ETCD CR. ETCD scale up mechanism in ETCD BR makes sure that the first instance of the StatefulSet is up and then the rests are added.

A PR (#382) is raised to take care of the above mentioned steps. In the step 5, we are scaling up ETCD from 1 to the number of the replicas mentioned in the CR. This step needs help from the scale up mechanism already implemented in ETCD BR (gardener/etcd-backup-restore#487). In the scale up mechanism, a multi node cluster should be able to scale up from 1 to 3 members. But that is not happening in co-ordination with quorum loss scenario that we are implementing. The newly added members due to scaling up are not joining to the cluster. Instead they are facing some errors which need to be resolved from ETCD BR side. I am currently studying why the errors are coming from ETCD BR side. This error was captured while we are trying to test the quorum loss setup with our local ETCD cluster. This is a manual test so far. Once we run this tests successfully we will test quorum loss scenario with local gardener setup.

Steps of the manual test that we are doing with our local ETCD cluster:

  1. Run the druid controller
  2. Create the ETCD CR.
  3. 3 ETCD pods will come up to form a ETCD cluster as we are setting repliacs as 3 in ETCD CR
  4. Put some data in ETCD database so that incremental snapshots are taken
  5. Apply a webhook that will not let new pods created.
  6. Delete two ETCD pods
  7. Quorum loss will happen. Druid will detect the quorum loss after 5 minute of buffer time.
  8. Druid scale down the statefulset to 0. Then, druid will scale the statefulset up to 1
  9. Remove the webhook that was not letting new pods to be created
  10. At first the first pod will come up successfully.
  11. Then the rest of the pods in 3 node cluster will try to join the cluster. Currently, the rest of the pods are not able to join the ETCD cluster due to some error.

@abdasgupta
Copy link
Contributor Author

As mitigation plan until the quorum loss scenario is available, a DoD has to step in if quorum loss happens. S/he has to change the replicas in ETCD CR to 1. Scaledown the ETCD statefulset to zero, delete all of the PVCs and then scale up the Sts to 1.

@timuthy
Copy link
Member

timuthy commented Aug 30, 2022

S/he has to change the replicas in ETCD CR to 1. Scaledown the ETCD statefulset to zero, delete all of the PVCs and then scale up the Sts to 1.

Do we really need to delete the volumes at all times? Quorum loss can happen for different reasons, e.g. 2/3 etcd pods were evicted due to resource shortage. Here it won't help and is even unnecessary to delete the volumes.
Also, there are etcds without backup: Can we recover manually from quorum loss w/o deleting the volumes at all? (see example)
Even if the operators can delete the volumes because backup-restore will take care to restore, they should at least double check the backup state before.

@abdasgupta
Copy link
Contributor Author

Closing this issue as we have divided the problem in this issue into two and raised #436 and #437 to track them

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension release/beta Planned for Beta release of the Feature status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants