Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the logic for quorum loss scenario #382

Closed
wants to merge 1 commit into from

Conversation

abdasgupta
Copy link
Contributor

@abdasgupta abdasgupta commented Jul 25, 2022

How to categorize this PR?

/area control-plane
/kind enhancement

What this PR does / why we need it:
This PR restores a multinode ETCD cluster after the quorum is lost. The steps are like following:

  1. There is already provision for Health check of cluster in ETCD Druid code base. The health check is based on Member Lease renewal time. It also deduct the quorum loss case. This health check is used by Druid
  2. Druid custodion controller monitors the cluster health at a regular interval
  3. If quorum loss is detected, custodion controller will put an annotation on ETCD CR to indicate that ETCD controller needs to fix quorum loss.
  4. ETCD controller scales down the ETCD statefulset to 0 and delete the PVCs
  5. ETCD controller deploys the statefulset with replicas = 1
  6. ETCD controller scales up the statefulset equal to the replicas mentioned in ETCD CR. ETCD scale up mechanism in ETCD BR makes sure that the first instance of the StatefulSet is up and then the rests are added.
    Which issue(s) this PR fixes:
    Fixes [Feature] Handle quorum loss scenario by Druid in ETCD multinode #362

Special notes for your reviewer:

Release note:

1. A new annotation `gardener.cloud/quorum-loss=true` is used on ETCD CR to indicate the quorum loss happened in the ETCD multi node cluster. If there is no quorum loss, either `gardener.cloud/quorum-loss=false` is set or the annotation is not set at all.
1. ETCD multinode cluster can recover now during quorum loss case. If quorum is lost for multinode cluster and the lost nodes can't be scheduled for some period, then during that period, a temporary one node ETCD cluster will serve the requests and when the new nodes can be scheduled again, rest of the nodes will join the cluster without any manual intervention.

@abdasgupta abdasgupta requested a review from a team as a code owner July 25, 2022 07:48
@gardener-robot
Copy link

@abdasgupta Labels area/todo, kind/todo do not exist.

@gardener-robot gardener-robot added the needs/review Needs review label Jul 25, 2022
@abdasgupta abdasgupta marked this pull request as draft July 25, 2022 07:48
@gardener-robot gardener-robot added size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Jul 25, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 25, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 25, 2022
Copy link
Contributor

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could only review one file. Will push further review comments tomorrow.

type ClusterMgmtController struct {
client.Client
logger logr.Logger
ImageVector imagevector.ImageVector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ImageVector is not used anywhere. Do you plan to use it later?

}

// SetupWithManager sets up manager with a new controller and cmc as the reconcile.Reconciler
func (cmc *ClusterMgmtController) SetupWithManager(mgr ctrl.Manager, workers int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method not used anywhere. Should ideally be called to register the controller with the manager

}

// NewClusterMgmtController creates a new ClusterMgmtController object
func NewClusterMgmtController(mgr manager.Manager) *ClusterMgmtController {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function not used anywhere currently

type ClusterMgmtController struct {
client.Client
logger logr.Logger
ImageVector imagevector.ImageVector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImageVector is not used anywhere. Do you intend to use it later?

// SetupWithManager sets up manager with a new controller and cmc as the reconcile.Reconciler
func (cmc *ClusterMgmtController) SetupWithManager(mgr ctrl.Manager, workers int) error {

ctrl, err := controller.New(clusterMgmtControllerName, mgr, controller.Options{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
There is a more concise way to implement it. Then you do not have to explicitly handle the error returned from controller.New

	return ctrl.NewControllerManagedBy(mgr).
		WithOptions(controller.Options{
			MaxConcurrentReconciles: workers,
		}).Watches(
		&source.Kind{Type: &coordinationv1.Lease{}},
		&handler.EnqueueRequestForOwner{OwnerType: &druidv1alpha1.Etcd{}, IsController: true},
		builder.WithPredicates(druidpredicates.IsMemberLease()),
	).Complete(cmc)

}, fmt.Errorf("cound not scale up statefulset to replica number : %v", err)
}

continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the need of continue here?


logger := cmc.logger.WithValues("etcd", kutil.Key(etcd.Namespace, etcd.Name).String())

// run a loop every 5 minutes that will monitor the cluster health and take action if members in the etcd cluster are down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is written that the loop runs every 5 mins i could not see the delay between 2 loop runs. If there is no error then this loop can run forever and then there is no chance of processing of the next reconcile request. Maybe i miss something. Please check.

}, fmt.Errorf("cound not fetch statefulset: %v", err)
}

if _, err := controllerutils.GetAndCreateOrStrategicMergePatch(ctx, cmc.Client, sts, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally these steps of scaling down, deletion of PVC and scaling up to 1 should in separate functions and one can use flow which is also used in gardener for this.

}

func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
labels := make(map[string]string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use make with initial size. Here you have a fixed size of 2.

func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
labels := make(map[string]string)

labels["name"] = sts.Labels["name"]
Copy link
Contributor

@unmarshall unmarshall Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will label name and instance be always present in sts? If not then you should check for existence of key before adding to the new map.
Suggestion:

func getMatchingLabels(sts *appsv1.StatefulSet) map[string]string {
	const nameLabelKey = "name"
	const instanceLabelKey = "instance"
	labels := make(map[string]string, 2)
	if v, ok := sts.Labels[nameLabelKey]; ok {
		labels[nameLabelKey] = v
	}
	if v, ok := sts.Labels[instanceLabelKey]; ok {
		labels[instanceLabelKey] = v
	}
	return labels
}

if err := cmc.DeleteAllOf(ctx, &corev1.PersistentVolumeClaim{},
client.InNamespace(sts.GetNamespace()),
client.MatchingLabels(getMatchingLabels(sts))); err != nil {
return ctrl.Result{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error during deletion of PVC, then the request will be re-queued after 10s, after 10 seconds you repeat the above step of setting the replicas to 0. This can be avoided as you already have queries the KAPI to get the sts. If the replicas is already 0 then the above step can be skipped.

}

// scale up the statefulset to ETCD replicas
if _, err := controllerutils.GetAndCreateOrStrategicMergePatch(ctx, cmc.Client, sts, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous step the scale up is done to 1 replica and immediately after that the scale up is again attempted to be equal to the original etcd replicas. Assuming that the above call only changes the spec in etcd and returns and does not wait for the scale up to complete. Is this intended or should you wait for the first replica to be healthy before scaling up from 1-3?

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 27, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 27, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 27, 2022
@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Aug 3, 2022
@gardener-robot
Copy link

@abdasgupta You need rebase this pull request with latest master branch. Please check.

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 3, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 3, 2022
@gardener-robot gardener-robot added area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension labels Aug 3, 2022
@abdasgupta abdasgupta marked this pull request as ready for review August 3, 2022 09:42
@gardener-robot gardener-robot added the size/l Size of pull request is large (see gardener-robot robot/bots/size.py) label Aug 3, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 1, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 2, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 2, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Sep 2, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 8, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 8, 2022
@abdasgupta
Copy link
Contributor Author

Added a flag in ETCD druid that will not allow Druid to apply quorum-loss annotation in case of quorum-loss. The flag needs to be set true if somebody wants to allow Druid to handle quorum loss automatically

controllers/etcd_controller.go Show resolved Hide resolved
controllers/etcd_controller.go Outdated Show resolved Hide resolved
controllers/etcd_controller.go Outdated Show resolved Hide resolved
@@ -108,6 +122,37 @@ func (ec *EtcdCustodian) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.
return ctrl.Result{}, err
}

conLength := len(etcd.Status.Conditions)
if conLength > 0 && etcd.Status.Conditions[conLength-1].Reason == "QuorumLost" && etcd.Spec.Replicas > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please iterate through the condition list and find the condition with type Ready before using it in the if condition?
Assuming it to be the last condition in the array might be okay now, but future changes might cause this to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should consider any earlier conditions from the list. It might be very disastrous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean considering any other condition. What I meant is that the Ready condition that gives Reason == "QuorumLost" might not always be at conLength-1 and it might be good if possible to parse through all the conditions and make sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose, there is a quorum loss. Action is taken on it and then there is some other condition appeared. If we are parsing the whole list every time then we may pick old quorum loss case which is already taken care of. Is it not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list would have the same conditions. The reasons would change depending on the state of the cluster.
When we recover from quorum loss for instance, the condition would be updated and hence picking up an old quorum loss case will not happen imo

conLength := len(etcd.Status.Conditions)
if conLength > 0 && etcd.Status.Conditions[conLength-1].Reason == "QuorumLost" && etcd.Spec.Replicas > 1 {
logger.Info("Quorum loss detected. Taking measures to fix it.")
if !ec.config.EnableAutomaticQuorumLossHandling {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rather have the rest of this block inside an if ec.config.EnableAutomaticQuorumLossHandling{...} so that in cases where quorum is lost and the flag is not set, we shouldn't be blocking the rest the code like updateEtcdStatus and/or other functions from running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to update ETCD status if the quorum is lost? If quorum is lost and ETCD status is updated, ETCD controller may start working to fix the ETCD . This will give totally unintended result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay
Fair enough

@@ -73,7 +73,7 @@ func (r *readyCheck) Check(ctx context.Context, etcd druidv1alpha1.Etcd) []Resul
renew := lease.Spec.RenewTime
if renew == nil {
r.logger.Info("Member hasn't acquired lease yet, still in bootstrapping phase", "name", lease.Name)
continue
return []Result{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?
Won't returning an empty result set if any lease has an empty RenewTime result in an empty member list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't get you what you asked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is, let's say the status already had 2 entries in the member list in the etcd status.
Let's say we add a third member and at the start at lease, the new leave will have a nil renewTime. This returns []Result{} and will result in the member list in the etcd status being completely removed

pkg/health/status/check.go Outdated Show resolved Hide resolved
@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Sep 9, 2022
Copy link
Contributor

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add permissions for etcds in charts/etcd/templates/etcd-role.yaml as etcd-backup-restore is now intended to read the etcd resource (ref: here)

@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Sep 12, 2022
@abdasgupta
Copy link
Contributor Author

abdasgupta commented Sep 12, 2022

Please add permissions for etcds in charts/etcd/templates/etcd-role.yaml as etcd-backup-restore is now intended to read the etcd resource (ref: here)

Like this ? :

- apiGroups:
  - druid.gardener.cloud/v1alpha1
  resources:
  - etcds
  verbs:
  - get
  - list
  - patch
  - update
  - watch

@aaronfern
Copy link
Contributor

yes, but afaik we don't need the v1aplha1

- apiGroups:
  - druid.gardener.cloud
  resources:
  - etcds
  verbs:
  - get
  - list
  - patch
  - update
  - watch

@ashwani2k
Copy link
Collaborator

/hold pending test results

@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Sep 13, 2022
@ishan16696
Copy link
Member

@abdasgupta I guess we can close this PR ?

@abdasgupta
Copy link
Contributor Author

We are closing this PR as we decided to handle quorum loss differently. We have identified that two types of quorum loss can happen in the ETCD clusters. One is transient quorum loss. Our ETCD cluster in control plane can recover automatically from transient quorum loss which involves unavailability of most of the ETCD pods due to network problem, resource problem etc. Another is permanent quorum loss. This involves permanent loss of ETCD data directory. A human operator needs to intervene in that case and recover the ETCD cluster. We are writing a playbook for the human operator to follow. The playbook mainly contains the following steps:

  1. Scale down the ETCD statefulset to 0
  2. Delete all of the PVCs of ETCD pods
  3. Scale up the ETCD statefulset to 1
  4. Verify that the one ETCD pod recovered correctly
  5. Scale up the ETCD statefulset as mentioned in replicas of ETCD CR
  6. Verify the cluster

@abdasgupta abdasgupta closed this Oct 17, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Handle quorum loss scenario by Druid in ETCD multinode
9 participants