K8SPXC-1152: restore stucks on operator restart #1610

pooknull · 2024-01-30T13:01:09Z

https://perconadev.atlassian.net/browse/K8SPXC-1152

CHANGE DESCRIPTION

Problem:
If operator pod is restarted during a restore, it can't continue to the restore process.

Cause:
The current design of the restore process is not designed to continue on operator restart.

Solution:
We should refactor the restore code so that the operator can catch up with the current state of the restore and continue.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are the manifests (crd/bundle) regenerated if needed?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPXC-1152

inelpandzic · 2024-02-06T06:50:42Z

pkg/controller/pxcrestore/controller.go

 		if err != nil {
-			return rr, errors.Wrap(err, "run pitr")
+			switch err {
+			case errWaitingPods, errWaitingPVC:


Could we check these errors on the if line so we can avoid this inner switch?

egegunes · 2024-02-06T07:28:46Z

pkg/controller/pxcrestore/controller.go

+				}
+				return rr, nil
+			} else {
+				if cluster.Status.ObservedGeneration == cluster.Generation && cluster.Status.PXC.Status == api.AppStateReady {


not sure about this condition. why do we say waiting for cluster to start only if cluster.Status.PXC.Status is ready?

egegunes · 2024-02-06T07:29:50Z

pkg/controller/pxcrestore/controller.go

+	rr := reconcile.Result{
+		RequeueAfter: time.Second * 5,
+	}


honestly I'm not happy to depend on RequeueAfter but I guess there's no other way

There is a way to not depend on RequeueAfter, but it will take more time to implement. I would like to do it in a separate PR.

egegunes · 2024-02-06T07:32:43Z

pkg/controller/pxcrestore/restorer.go

+	if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {
+		if k8serrors.IsNotFound(err) {
+			initInProcess = false
+		}
 	}


Suggested change

if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); err != nil {

if k8serrors.IsNotFound(err) {

initInProcess = false

}

}

if err := s.k8sClient.Get(ctx, types.NamespacedName{Name: svc.Name, Namespace: svc.Namespace}, svc); k8serrors.IsNotFound(err) {

initInProcess = false

}

JNKPercona · 2024-02-07T15:24:22Z

Test name	Status
affinity-8-0	passed
auto-tuning-8-0	passed
cross-site-8-0	passed
demand-backup-cloud-8-0	passed
demand-backup-encrypted-with-tls-8-0	passed
demand-backup-8-0	passed
haproxy-5-7	passed
haproxy-8-0	passed
init-deploy-5-7	passed
init-deploy-8-0	passed
limits-8-0	passed
monitoring-2-0-8-0	passed
one-pod-5-7	passed
one-pod-8-0	passed
pitr-8-0	passed
pitr-gap-errors-8-0	failure
proxy-protocol-8-0	passed
proxysql-sidecar-res-limits-8-0	passed
pvc-resize-5-7	passed
pvc-resize-8-0	passed
recreate-8-0	passed
restore-to-encrypted-cluster-8-0	passed
scaling-proxysql-8-0	passed
scaling-8-0	passed
scheduled-backup-5-7	passed
scheduled-backup-8-0	passed
security-context-8-0	passed
smart-update1-8-0	passed
smart-update2-8-0	passed
storage-8-0	passed
tls-issue-cert-manager-ref-8-0	passed
tls-issue-cert-manager-8-0	passed
tls-issue-self-8-0	passed
upgrade-consistency-8-0	passed
upgrade-haproxy-8-0	passed
upgrade-proxysql-8-0	passed
users-5-7	passed
users-8-0	failure
validation-hook-8-0	passed
We run 39 out of 39

commit: f86ff81
image: perconalab/percona-xtradb-cluster-operator:PR-1610-f86ff811

K8SPXC-1152: restore stucks on operator restart

005b58d

https://perconadev.atlassian.net/browse/K8SPXC-1152

pull-request-size bot added the size/XXL 1000+ lines label Jan 30, 2024

pooknull added 7 commits January 30, 2024 15:03

fix unit test

d0609da

fix for pvc restores

d3d247b

refactor

05af382

fix tests

136bcc5

fix security-context test

8f512a4

add unit-test

34666c1

Merge branch 'main' into dev/K8SPXC-1152

87782c3

pooknull marked this pull request as ready for review February 5, 2024 13:56

pooknull requested review from tplavcic, nmarukovich, ptankov, hors, egegunes and inelpandzic as code owners February 5, 2024 13:56

Merge branch 'main' into dev/K8SPXC-1152

8573d5b

inelpandzic previously approved these changes Feb 6, 2024

View reviewed changes

egegunes reviewed Feb 6, 2024

View reviewed changes

improvements

7b5a420

pooknull dismissed inelpandzic’s stale review via 7b5a420 February 7, 2024 12:21

pooknull added 2 commits February 7, 2024 14:34

add TODO comment

7ea4009

Merge branch 'main' into dev/K8SPXC-1152

f86ff81

pooknull requested review from inelpandzic and egegunes February 7, 2024 12:36

inelpandzic approved these changes Feb 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPXC-1152: restore stucks on operator restart #1610

K8SPXC-1152: restore stucks on operator restart #1610

pooknull commented Jan 30, 2024 •

edited

Loading

inelpandzic Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

egegunes Feb 6, 2024

pooknull Feb 7, 2024

JNKPercona commented Feb 7, 2024

K8SPXC-1152: restore stucks on operator restart #1610

Are you sure you want to change the base?

K8SPXC-1152: restore stucks on operator restart #1610

Conversation

pooknull commented Jan 30, 2024 • edited Loading

CHANGE DESCRIPTION

CHECKLIST

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JNKPercona commented Feb 7, 2024

pooknull commented Jan 30, 2024 •

edited

Loading