New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added the logic for quorum loss scenario #382

Closed

abdasgupta wants to merge 1 commit into gardener:master from abdasgupta:qlosslogic

Contributor

abdasgupta commented Jul 25, 2022 •

edited

Loading

How to categorize this PR?

/area control-plane
/kind enhancement

What this PR does / why we need it:
This PR restores a multinode ETCD cluster after the quorum is lost. The steps are like following:

There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check is used by Druid
Druid custodion controller monitors the cluster health at a regular interval
If quorum loss is detected, custodion controller will put an annotation on ETCD CR to indicate that ETCD controller needs to fix quorum loss.
ETCD controller scales down the ETCD statefulset to 0 and delete the PVCs
ETCD controller deploys the statefulset with replicas = 1
ETCD controller scales up the statefulset equal to the replicas mentioned in ETCD CR. ETCD scale up mechanism in ETCD BR makes sure that the first instance of the StatefulSet is up and then the rests are added.
Which issue(s) this PR fixes:
Fixes [Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

Special notes for your reviewer:

Release note:

1. A new annotation `gardener.cloud/quorum-loss=true` is used on ETCD CR to indicate the quorum loss happened in the ETCD multi node cluster. If there is no quorum loss, either `gardener.cloud/quorum-loss=false` is set or the annotation is not set at all.

1. ETCD multinode cluster can recover now during quorum loss case. If quorum is lost for multinode cluster and the lost nodes can't be scheduled for some period, then during that period, a temporary one node ETCD cluster will serve the requests and when the new nodes can be scheduled again, rest of the nodes will join the cluster without any manual intervention.

abdasgupta requested a review from a team as a code owner

July 25, 2022 07:48

gardener-robot commented Jul 25, 2022

@abdasgupta Labels area/todo, kind/todo do not exist.

gardener-robot added the needs/review label

abdasgupta marked this pull request as draft

July 25, 2022 07:48

gardener-robot added size/xl needs/second-opinion labels

gardener-robot-ci-3 added the reviewed/ok-to-test label

gardener-robot-ci-2 added needs/ok-to-test and removed reviewed/ok-to-test labels

unmarshall reviewed

View reviewed changes

Contributor

unmarshall left a comment

I could only review one file. Will push further review comments tomorrow.

controllers/cluster_mgmt_controller.go Outdated

+              type ClusterMgmtController struct {
+              	client.Client
+              	logger      logr.Logger
+              	ImageVector imagevector.ImageVector

Contributor

unmarshall Jul 26, 2022

Looks like ImageVector is not used anywhere. Do you plan to use it later?

controllers/cluster_mgmt_controller.go Outdated

+              }
+              // SetupWithManager sets up manager with a new controller and cmc as the reconcile.Reconciler
+              func (cmc *ClusterMgmtController) SetupWithManager(mgr ctrl.Manager, workers int) error {

Contributor

unmarshall Jul 26, 2022

Method not used anywhere. Should ideally be called to register the controller with the manager

controllers/cluster_mgmt_controller.go Outdated

+              }
+              // NewClusterMgmtController creates a new ClusterMgmtController object
+              func NewClusterMgmtController(mgr manager.Manager) *ClusterMgmtController {

Contributor

unmarshall Jul 26, 2022

Function not used anywhere currently

controllers/cluster_mgmt_controller.go Outdated

+              type ClusterMgmtController struct {
+              	client.Client
+              	logger      logr.Logger
+              	ImageVector imagevector.ImageVector

Contributor

unmarshall Jul 26, 2022

ImageVector is not used anywhere. Do you intend to use it later?

controllers/cluster_mgmt_controller.go Outdated

+              // SetupWithManager sets up manager with a new controller and cmc as the reconcile.Reconciler
+              func (cmc *ClusterMgmtController) SetupWithManager(mgr ctrl.Manager, workers int) error {
+              	ctrl, err := controller.New(clusterMgmtControllerName, mgr, controller.Options{

Contributor

unmarshall Jul 26, 2022

Minor:
There is a more concise way to implement it. Then you do not have to explicitly handle the error returned from controller.New

	return ctrl.NewControllerManagedBy(mgr).
		WithOptions(controller.Options{
			MaxConcurrentReconciles: workers,
		}).Watches(
		&source.Kind{Type: &coordinationv1.Lease{}},
		&handler.EnqueueRequestForOwner{OwnerType: &druidv1alpha1.Etcd{}, IsController: true},
		builder.WithPredicates(druidpredicates.IsMemberLease()),
	).Complete(cmc)

controllers/cluster_mgmt_controller.go Outdated

+              				}, fmt.Errorf("cound not scale up statefulset to replica number : %v", err)
+              			}
+              			continue

Contributor

unmarshall Jul 26, 2022

what is the need of continue here?

controllers/cluster_mgmt_controller.go Outdated


		logger := cmc.logger.WithValues("etcd", kutil.Key(etcd.Namespace, etcd.Name).String())

		// run a loop every 5 minutes that will monitor the cluster health and take action if members in the etcd cluster are down

Contributor

unmarshall Jul 26, 2022

It is written that the loop runs every 5 mins i could not see the delay between 2 loop runs. If there is no error then this loop can run forever and then there is no chance of processing of the next reconcile request. Maybe i miss something. Please check.

controllers/cluster_mgmt_controller.go Outdated

+              				}, fmt.Errorf("cound not fetch statefulset: %v", err)
+              			}
+              			if _, err := controllerutils.GetAndCreateOrStrategicMergePatch(ctx, cmc.Client, sts, func() error {

Contributor

unmarshall Jul 26, 2022

Ideally these steps of scaling down, deletion of PVC and scaling up to 1 should in separate functions and one can use flow which is also used in gardener for this.

controllers/cluster_mgmt_controller.go Outdated

+              }
+              func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
+              	labels := make(map[string]string)

Contributor

unmarshall Jul 26, 2022

use make with initial size. Here you have a fixed size of 2.

controllers/cluster_mgmt_controller.go Outdated

+              func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
+              	labels := make(map[string]string)
+              	labels["name"] = sts.Labels["name"]

Contributor

unmarshall Jul 26, 2022 •

edited

Loading

will label name and instance be always present in sts? If not then you should check for existence of key before adding to the new map.
Suggestion:

func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
	const nameLabelKey = "name"
	const instanceLabelKey = "instance"
	labels := make(map[string]string, 2)
	if v, ok := sts.Labels[nameLabelKey]; ok {
		labels[nameLabelKey] = v
	}
	if v, ok := sts.Labels[instanceLabelKey]; ok {
		labels[instanceLabelKey] = v
	}
	return labels
}

unmarshall reviewed

View reviewed changes

controllers/cluster_mgmt_controller.go Outdated

+              			if err := cmc.DeleteAllOf(ctx, &corev1.PersistentVolumeClaim{},
+              				client.InNamespace(sts.GetNamespace()),
+              				client.MatchingLabels(getMatchingLabels(sts))); err != nil {
+              				return ctrl.Result{

Contributor

unmarshall Jul 27, 2022

If there is an error during deletion of PVC, then the request will be re-queued after 10s, after 10 seconds you repeat the above step of setting the replicas to 0. This can be avoided as you already have queries the KAPI to get the sts. If the replicas is already 0 then the above step can be skipped.

controllers/cluster_mgmt_controller.go Outdated

+              			}
+              			// scale up the statefulset to ETCD replicas
+              			if _, err := controllerutils.GetAndCreateOrStrategicMergePatch(ctx, cmc.Client, sts, func() error {

Contributor

unmarshall Jul 27, 2022

In the previous step the scale up is done to 1 replica and immediately after that the scale up is again attempted to be equal to the original etcd replicas. Assuming that the above call only changes the spec in etcd and returns and does not wait for the scale up to complete. Is this intended or should you wait for the first replica to be healthy before scaling up from 1-3?

abdasgupta force-pushed the qlosslogic branch from 988bb5b to 491ea3c Compare

July 27, 2022 13:51

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

abdasgupta force-pushed the qlosslogic branch from 491ea3c to f6d9b7e Compare

July 27, 2022 19:19

gardener-robot-ci-1 added the reviewed/ok-to-test label

gardener-robot-ci-2 removed the reviewed/ok-to-test label

abdasgupta force-pushed the qlosslogic branch from f6d9b7e to 215fbc2 Compare

August 3, 2022 09:12

gardener-robot added the needs/rebase label

gardener-robot commented Aug 3, 2022

@abdasgupta You need rebase this pull request with latest master branch. Please check.

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

abdasgupta force-pushed the qlosslogic branch from 215fbc2 to fdb2a03 Compare

August 3, 2022 09:18

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

gardener-robot added area/control-plane kind/enhancement labels

abdasgupta marked this pull request as ready for review

August 3, 2022 09:42

abdasgupta force-pushed the qlosslogic branch from fdb2a03 to 4aa93a9 Compare

August 3, 2022 09:47

gardener-robot added the size/l label

gardener-robot-ci-1 removed the reviewed/ok-to-test label

abdasgupta force-pushed the qlosslogic branch from 957ab98 to b127876 Compare

September 2, 2022 06:06

gardener-robot-ci-2 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

abdasgupta force-pushed the qlosslogic branch from b127876 to b1fde3a Compare

September 2, 2022 06:07

gardener-robot-ci-2 added the reviewed/ok-to-test label

gardener-robot-ci-1 removed the reviewed/ok-to-test label

abdasgupta force-pushed the qlosslogic branch from b1fde3a to 1ccf4dd Compare

September 8, 2022 08:03

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

abdasgupta force-pushed the qlosslogic branch from 1ccf4dd to 33010f3 Compare

September 8, 2022 20:02

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

Contributor Author

abdasgupta commented Sep 9, 2022

Added a flag in ETCD druid that will not allow Druid to apply quorum-loss annotation in case of quorum-loss. The flag needs to be set true if somebody wants to allow Druid to handle quorum loss automatically

aaronfern requested changes

View reviewed changes

controllers/etcd_controller.go Show resolved Hide resolved

controllers/etcd_controller.go Outdated Show resolved Hide resolved

controllers/etcd_controller.go Outdated Show resolved Hide resolved

controllers/etcd_custodian_controller.go

               		return ctrl.Result{}, err
               	}
+              	conLength := len(etcd.Status.Conditions)
+              	if conLength > 0 && etcd.Status.Conditions[conLength-1].Reason == "QuorumLost" && etcd.Spec.Replicas > 1 {

Contributor

aaronfern Sep 9, 2022

Can you please iterate through the condition list and find the condition with type Ready before using it in the if condition?
Assuming it to be the last condition in the array might be okay now, but future changes might cause this to fail

Contributor Author

abdasgupta Sep 12, 2022

I don't think we should consider any earlier conditions from the list. It might be very disastrous

Contributor

aaronfern Sep 12, 2022

I don't mean considering any other condition. What I meant is that the Ready condition that gives Reason == "QuorumLost" might not always be at conLength-1 and it might be good if possible to parse through all the conditions and make sure

Contributor Author

abdasgupta Sep 12, 2022

Suppose, there is a quorum loss. Action is taken on it and then there is some other condition appeared. If we are parsing the whole list every time then we may pick old quorum loss case which is already taken care of. Is it not?

Contributor

aaronfern Sep 12, 2022

The list would have the same conditions. The reasons would change depending on the state of the cluster.
When we recover from quorum loss for instance, the condition would be updated and hence picking up an old quorum loss case will not happen imo

controllers/etcd_custodian_controller.go

+              	conLength := len(etcd.Status.Conditions)
+              	if conLength > 0 && etcd.Status.Conditions[conLength-1].Reason == "QuorumLost" && etcd.Spec.Replicas > 1 {
+              		logger.Info("Quorum loss detected. Taking measures to fix it.")
+              		if !ec.config.EnableAutomaticQuorumLossHandling {

Contributor

aaronfern Sep 9, 2022

Can we rather have the rest of this block inside an if ec.config.EnableAutomaticQuorumLossHandling{...} so that in cases where quorum is lost and the flag is not set, we shouldn't be blocking the rest the code like updateEtcdStatus and/or other functions from running

Contributor Author

abdasgupta Sep 12, 2022

Why would we want to update ETCD status if the quorum is lost? If quorum is lost and ETCD status is updated, ETCD controller may start working to fix the ETCD . This will give totally unintended result

Contributor

aaronfern Sep 12, 2022

Ah, okay
Fair enough

pkg/health/etcdmember/check_ready.go

               		renew := lease.Spec.RenewTime
               		if renew == nil {
               			r.logger.Info("Member hasn't acquired lease yet, still in bootstrapping phase", "name", lease.Name)
-              			continue
+              			return []Result{}

Contributor

aaronfern Sep 9, 2022

Why is this needed?
Won't returning an empty result set if any lease has an empty RenewTime result in an empty member list?

Contributor Author

abdasgupta Sep 12, 2022

Couldn't get you what you asked

Contributor

aaronfern Sep 12, 2022

What I meant is, let's say the status already had 2 entries in the member list in the etcd status.
Let's say we add a third member and at the start at lease, the new leave will have a nil renewTime. This returns []Result{} and will result in the member list in the etcd status being completely removed

pkg/health/status/check.go Outdated Show resolved Hide resolved

gardener-robot added the needs/changes label

aaronfern requested changes

View reviewed changes

Contributor

aaronfern left a comment

Please add permissions for etcds in charts/etcd/templates/etcd-role.yaml as etcd-backup-restore is now intended to read the etcd resource (ref: here)


          Added logics for quorum loss scenario.

5f5173a

abdasgupta force-pushed the qlosslogic branch from 33010f3 to 5f5173a Compare

September 12, 2022 06:53

gardener-robot-ci-2 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

Contributor Author

abdasgupta commented Sep 12, 2022 •

edited

Loading

Please add permissions for etcds in charts/etcd/templates/etcd-role.yaml as etcd-backup-restore is now intended to read the etcd resource (ref: here)

Like this ? :

- apiGroups:
  - druid.gardener.cloud/v1alpha1
  resources:
  - etcds
  verbs:
  - get
  - list
  - patch
  - update
  - watch

Contributor

aaronfern commented Sep 12, 2022

yes, but afaik we don't need the v1aplha1

- apiGroups:
  - druid.gardener.cloud
  resources:
  - etcds
  verbs:
  - get
  - list
  - patch
  - update
  - watch

Collaborator

ashwani2k commented Sep 13, 2022

/hold pending test results

gardener-robot added the reviewed/do-not-merge label

abdasgupta mentioned this pull request

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

Closed

Member

ishan16696 commented Oct 13, 2022

@abdasgupta I guess we can close this PR ?

Contributor Author

abdasgupta commented Oct 17, 2022

We are closing this PR as we decided to handle quorum loss differently. We have identified that two types of quorum loss can happen in the ETCD clusters. One is transient quorum loss. Our ETCD cluster in control plane can recover automatically from transient quorum loss which involves unavailability of most of the ETCD pods due to network problem, resource problem etc. Another is permanent quorum loss. This involves permanent loss of ETCD data directory. A human operator needs to intervene in that case and recover the ETCD cluster. We are writing a playbook for the human operator to follow. The playbook mainly contains the following steps:

Scale down the ETCD statefulset to 0
Delete all of the PVCs of ETCD pods
Scale up the ETCD statefulset to 1
Verify that the one ETCD pod recovered correctly
Scale up the ETCD statefulset as mentioned in replicas of ETCD CR
Verify the cluster

abdasgupta closed this

gardener-robot added the status/closed label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane kind/enhancement needs/changes needs/ok-to-test needs/rebase needs/review needs/second-opinion reviewed/do-not-merge size/xl status/closed