Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod fails to start after modifying the Nexus resource #191

Closed
bszeti opened this issue Dec 1, 2020 · 7 comments · Fixed by #192
Closed

Pod fails to start after modifying the Nexus resource #191

bszeti opened this issue Dec 1, 2020 · 7 comments · Fixed by #192
Labels
bug 🐛 Something isn't working
Milestone

Comments

@bszeti
Copy link

bszeti commented Dec 1, 2020

Describe the bug
Modifying the Nexus resource triggers a new Deployment, but the new Pod can't start (CrashLoopBackOff) because the previous one is still holding the /nexus-data/lock. The problem is probably cause by the Deployment using spec.strategy=RollingUpdate. Using "Recreate" may help, so previous Nexus instance is shut down before the new one is created.

To Reproduce
Steps to reproduce the behavior:

  1. Create a Nexus resource
  2. Modify the Nexus resource (e.g. mem requests)
  3. The new Pod won't start, but keeps crashing.

Expected behavior
The Deployment is successfully rolled out.

Environment
OpenShift 4.6.5
Client Version: 4.4.30
Server Version: 4.6.5
Kubernetes Version: v1.19.0+9f84db3

Additional context
Add any other context about the problem here.

@bszeti bszeti added the bug 🐛 Something isn't working label Dec 1, 2020
@ricardozanini
Copy link
Member

Hi @bszeti ! Thanks for filling this bug. I'll take a look at it today. That one could be tricky since we have some use cases that RollingUpdate would be a better fit. @LCaparelli any issues regarding the update scenario?

@ricardozanini ricardozanini added this to the v0.5.0 milestone Dec 1, 2020
@LCaparelli
Copy link
Member

Hey @bszeti thanks for raising the issue. As @ricardozanini mentioned the RollingUpdate option is a better fit for the way automatic updates are handled, so that your current deployment is not unavailable while the updated deployment is still being performed.

@bszeti can you please provide how you installed the operator and what version you're running? Along with that, please also share the output from:

$ oc describe pod

Be sure to run it in the project on which the failing pod is.

I'll try to replicate the issue, but so far here's what I have, installing v0.4.0 via OLM:

Click to expand

First, let's install OLM and the operator via OLM:

╰─ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
╰─ curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.17.0/install.sh | bash -s v0.17.0
customresourcedefinition.apiextensions.k8s.io/catalogsources.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/clusterserviceversions.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/installplans.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/operatorgroups.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/operators.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/subscriptions.operators.coreos.com created
customresourcedefinition.apiextensions.k8s.io/catalogsources.operators.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/clusterserviceversions.operators.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/installplans.operators.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/operatorgroups.operators.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/operators.operators.coreos.com condition met
customresourcedefinition.apiextensions.k8s.io/subscriptions.operators.coreos.com condition met
namespace/olm created
namespace/operators created
serviceaccount/olm-operator-serviceaccount created
clusterrole.rbac.authorization.k8s.io/system:controller:operator-lifecycle-manager created
clusterrolebinding.rbac.authorization.k8s.io/olm-operator-binding-olm created
deployment.apps/olm-operator created
deployment.apps/catalog-operator created
clusterrole.rbac.authorization.k8s.io/aggregate-olm-edit created
clusterrole.rbac.authorization.k8s.io/aggregate-olm-view created
operatorgroup.operators.coreos.com/global-operators created
operatorgroup.operators.coreos.com/olm-operators created
clusterserviceversion.operators.coreos.com/packageserver created
catalogsource.operators.coreos.com/operatorhubio-catalog created
Waiting for deployment "olm-operator" rollout to finish: 0 of 1 updated replicas are available...
deployment "olm-operator" successfully rolled out
Waiting for deployment "catalog-operator" rollout to finish: 0 of 1 updated replicas are available...
deployment "catalog-operator" successfully rolled out
Package server phase: Installing
Package server phase: Succeeded
deployment "packageserver" successfully rolled out
╰─ kubectl create -f https://operatorhub.io/install/nexus-operator-m88i.yaml
subscription.operators.coreos.com/my-nexus-operator-m88i created
╰─ kubectl get csv -n operators -w
NAME                    DISPLAY          VERSION   REPLACES                PHASE
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   Pending
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   InstallReady
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   InstallReady
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   Installing
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   Installing
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   Succeeded
nexus-operator.v0.4.0   Nexus Operator   0.4.0     nexus-operator.v0.3.0   Succeeded
^C%

So far so good. Let's instantiate a Nexus CR using the sample from OperatorHub:

╰─ echo "apiVersion: apps.m88i.io/v1alpha1
kind: Nexus
metadata:
  name: nexus3
spec:
  networking:
    expose: false
  persistence:
    persistent: false
  replicas: 1
  resources:
    limits:
      cpu: '2'
      memory: 2Gi
    requests:
      cpu: '1'
      memory: 2Gi
  useRedHatImage: false" | kubectl apply -f -
nexus.apps.m88i.io/nexus3 created
╰─ kubectl get pods -w
NAME                      READY   STATUS              RESTARTS   AGE
nexus3-558f8fcd68-dmh5c   0/1     ContainerCreating   0          6s
nexus3-558f8fcd68-dmh5c   0/1     Running             0          41s
nexus3-558f8fcd68-dmh5c   1/1     Running             0          4m46s
^C%

All good. Now let's change the memory limit to 2512 MiB:

╰─ echo "apiVersion: apps.m88i.io/v1alpha1
kind: Nexus
metadata:
  name: nexus3
spec:
  networking:
    expose: false
  persistence:
    persistent: false
  replicas: 1
  resources:
    limits:
      cpu: '2'
      memory: 2512Mi # notice this changed 
    requests:
      cpu: '1'
      memory: 2Gi
  useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w
nexus.apps.m88i.io/nexus3 configured
NAME                      READY   STATUS    RESTARTS   AGE
nexus3-558f8fcd68-dmh5c   1/1     Running   0          6m19s
nexus3-6f7b5858b9-wv6wq   0/1     Pending   0          0s
nexus3-6f7b5858b9-wv6wq   0/1     Pending   0          0s
nexus3-6f7b5858b9-wv6wq   0/1     ContainerCreating   0          0s
nexus3-6f7b5858b9-wv6wq   0/1     Running             0          3s
nexus3-6f7b5858b9-wv6wq   1/1     Running             0          4m4s
nexus3-558f8fcd68-dmh5c   1/1     Terminating         0          10m
nexus3-558f8fcd68-dmh5c   0/1     Terminating         0          10m
nexus3-558f8fcd68-dmh5c   0/1     Terminating         0          10m
nexus3-558f8fcd68-dmh5c   0/1     Terminating         0          10m
^C%                                                                                                                                                                          
╰─ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
nexus3-6f7b5858b9-wv6wq   1/1     Running   0          4m37s

The newer deployment faced no issues and the previous was terminated as expected, when the newer was available. Of course, this is not Openshift. I can't test it myself on Openshift, but we'll see if we can test it there as well. @ricardozanini @Kaitou786 if any of you could try replicating the issue on OCP that would be great :-)

@ricardozanini
Copy link
Member

@LCaparelli I believe he also has a PV updated with some information. That way, Nexus will lock the data directory, preventing the rollingupdate.

If this is the case, we must change to "Recreate", since even for updating we won't be able to do it with the data folder locked. Or at least, signal to the server to unlock the data directory before performing the update.

@LCaparelli
Copy link
Member

@ricardozanini So if I enable persistence I should run into this issue, right? Let me give that a swing

@LCaparelli
Copy link
Member

Ah yes, indeed. I have reproduced the same issue. Simply using Recreate when persistence is enabled would do the trick, but would bring availability issues for automatic updates with persistence. I'll give it some further thought, perhaps there's a way to deal with this without negative outcomes.

At the moment no action from you is requested @bszeti, thanks again for reporting it. :-)

Click to expand

Let's enable persistence and go back to 2 GiB limit:

╰─ echo "apiVersion: apps.m88i.io/v1alpha1
kind: Nexus     
metadata:
  name: nexus3
spec:
  networking:
    expose: false
  persistence:
    persistent: true 
  replicas: 1
  resources:
    limits:
      cpu: '2'
      memory: 2Gi                         
    requests:
      cpu: '1'
      memory: 2Gi
  useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w
nexus.apps.m88i.io/nexus3 configured
NAME                      READY   STATUS    RESTARTS   AGE
nexus3-6f7b5858b9-wv6wq   0/1     Running   1          30m
nexus3-7d7d54d566-5k96q   0/1     Pending   0          0s
nexus3-7d7d54d566-5k96q   0/1     Pending   0          1s
nexus3-7d7d54d566-5k96q   0/1     Pending   0          1s
nexus3-7d7d54d566-5k96q   0/1     ContainerCreating   0          2s
nexus3-7d7d54d566-5k96q   0/1     Running             0          19s
nexus3-6f7b5858b9-wv6wq   1/1     Running             1          33m
nexus3-7d7d54d566-5k96q   1/1     Running             0          4m19s
nexus3-6f7b5858b9-wv6wq   1/1     Terminating         1          35m
nexus3-6f7b5858b9-wv6wq   0/1     Terminating         1          35m
nexus3-6f7b5858b9-wv6wq   0/1     Terminating         1          35m
nexus3-6f7b5858b9-wv6wq   0/1     Terminating         1          35m
^C%
╰─ kubectl get pod
NAME                      READY   STATUS    RESTARTS   AGE
nexus3-7d7d54d566-5k96q   1/1     Running   0          8m                                                                               

And then change the memory limit:

╰─ echo "apiVersion: apps.m88i.io/v1alpha1
kind: Nexus
metadata:
  name: nexus3
spec:
  networking:
    expose: false
  persistence:
    persistent: true
  replicas: 1
  resources:
    limits:
      cpu: '2'
      memory: 2512Mi # notice this changed
    requests:
      cpu: '1'
      memory: 2Gi
  useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w
nexus.apps.m88i.io/nexus3 configured
NAME                      READY   STATUS    RESTARTS   AGE
nexus3-7d7d54d566-5k96q   1/1     Running   0          8m28s
nexus3-54f857b97f-vqzrj   0/1     Pending   0          1s
nexus3-54f857b97f-vqzrj   0/1     Pending   0          1s
nexus3-54f857b97f-vqzrj   0/1     ContainerCreating   0          1s
nexus3-54f857b97f-vqzrj   0/1     Running             0          4s
nexus3-54f857b97f-vqzrj   0/1     Completed           0          6s
nexus3-54f857b97f-vqzrj   0/1     Running             1          7s
nexus3-54f857b97f-vqzrj   0/1     Completed           1          8s
nexus3-54f857b97f-vqzrj   0/1     CrashLoopBackOff    1          16s
^C%
╰─ kubectl describe pod nexus3-54f857b97f-vqzrj
Name:         nexus3-54f857b97f-vqzrj
Namespace:    default
Priority:     0
Node:         minikube/172.17.0.2
Start Time:   Tue, 01 Dec 2020 12:28:26 -0300
Labels:       app=nexus3
              pod-template-hash=54f857b97f
Annotations:  <none>
Status:       Running
IP:           172.18.0.8
IPs:
  IP:           172.18.0.8
Controlled By:  ReplicaSet/nexus3-54f857b97f
Containers:
  nexus-server:
    Container ID:   docker://5f230d740cf43a90c0f6d6c88d5b008f57d5c0c29230e127aa90304a9b6f2a5f
    Image:          docker.io/sonatype/nexus3:3.28.1
    Image ID:       docker-pullable://sonatype/nexus3@sha256:e788154207df95a86287fc8ecae1a5f789e74d0124f2dbda846fc4c769603bdb
    Port:           8081/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 01 Dec 2020 12:28:31 -0300
      Finished:     Tue, 01 Dec 2020 12:28:32 -0300
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     2
      memory:  2512Mi
    Requests:
      cpu:      1
      memory:   2Gi
    Liveness:   http-get http://:8081/service/rest/v1/status delay=240s timeout=15s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8081/service/rest/v1/status delay=240s timeout=15s period=10s #success=1 #failure=3
    Environment:
      INSTALL4J_ADD_VM_PARAMS:  -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs -Dnexus.security.randompassword=false -XX:MaxDirectMemorySize=2635m -Xms2108m -Xmx2108m 
    Mounts:
      /nexus-data from nexus3-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nexus3-token-cktvh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nexus3-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nexus3
    ReadOnly:   false
  nexus3-token-cktvh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nexus3-token-cktvh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  <unknown>                             Successfully assigned default/nexus3-54f857b97f-vqzrj to minikube
  Normal   Pulled     21s (x2 over 24s)  kubelet, minikube  Container image "docker.io/sonatype/nexus3:3.28.1" already present on machine
  Normal   Created    21s (x2 over 24s)  kubelet, minikube  Created container nexus-server
  Normal   Started    21s (x2 over 24s)  kubelet, minikube  Started container nexus-server
  Warning  BackOff    11s (x2 over 19s)  kubelet, minikube  Back-off restarting failed container                 

@bszeti
Copy link
Author

bszeti commented Dec 1, 2020

Hi, Thanks for looking into this.

Yes, of course the issue only shows up if you use persistence. Nexus has a lock file, so the new Pod can't start until the old one is running. (By the way isn't this a problem if number of Nexus replicas is greater than one??)

Install operator:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nexus
spec:
  targetNamespaces: 
  - nexus
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nexus-operator-m88i
spec:
  channel: alpha
  name: nexus-operator-m88i
  source: community-operators
  sourceNamespace: openshift-marketplace

Install Nexus:

apiVersion: apps.m88i.io/v1alpha1
kind: Nexus
metadata:
  name: nexus3
spec:
  resources:
    limits:
      cpu: '2'
      memory: 2Gi
    requests:
      cpu: 1000m
      memory: 2Gi
  useRedHatImage: true
  serverOperations:
    disableOperatorUserCreation: false
  imagePullPolicy: Always
  networking:
    expose: true
    exposeAs: Route
    tls:
      mandatory: true
  replicas: 1
  persistence:
    persistent: true
    volumeSize: 10Gi

@ricardozanini
Copy link
Member

Hi @bszeti, yes it's a problem. Only horizontal scaling is supported at this time until we implement #61
I'll also take a look into the Nexus documentation to see if I can figure out another workaround for this problem.

ricardozanini added a commit that referenced this issue Dec 9, 2020
…cas to max 1

Signed-off-by: Ricardo Zanini <zanini@redhat.com>
LCaparelli added a commit that referenced this issue Dec 14, 2020
…cas to max 1 (#192)

* Fix #191 - Changing deployment strategy to recreate and setting replicas to max 1

Signed-off-by: Ricardo Zanini <zanini@redhat.com>

* reverting back openapi gen files

Signed-off-by: Ricardo Zanini <zanini@redhat.com>

* Apply suggestions from code review

Co-authored-by: Lucas Caparelli <lucas.caparelli112@gmail.com>

* Move mutability to defaults

Signed-off-by: Ricardo Zanini <ricardozanini@gmail.com>

Co-authored-by: Lucas Caparelli <lucas.caparelli112@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
3 participants