increase default timeout in helm upgrade command #17

jhamman · 2019-04-04T21:47:33Z

In one of our jupyterhub deployments (running on eks), we've been getting semi-regular helm upgrade timeouts when using hubploy:

Downloading pangeo from repo https://pangeo-data.github.io/helm-chart/
Deleting outdated charts
Release "icesat2-prod" does not exist. Installing it now.
Error: release icesat2-prod failed: timed out waiting for the condition
Traceback (most recent call last):
  File "/home/circleci/repo/venv/bin/hubploy", line 11, in <module>
    load_entry_point('hubploy==0.1.0', 'console_scripts', 'hubploy')()
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/__main__.py", line 65, in main
    helm.deploy(args.deployment, args.chart, args.environment, args.namespace, args.set, args.version)
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/helm.py", line 128, in deploy
    version
  File "/home/circleci/repo/venv/lib/python3.7/site-packages/hubploy/helm.py", line 58, in helm_upgrade
    subprocess.check_call(cmd)
  File "/usr/local/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['helm', 'upgrade', '--wait', '--install', '--namespace', 'icesat2-prod', 'icesat2-prod', 'pangeo-deploy', '-f', 'deployments/icesat2/config/common.yaml', '-f', 'deployments/icesat2/config/prod.yaml', '-f', 'deployments/icesat2/secrets/prod.yaml', '--set', 'pangeo.jupyterhub.singleuser.image.tag=2035c99', '--set', 'pangeo.jupyterhub.singleuser.image.name=783380859522.dkr.ecr.us-west-2.amazonaws.com/pangeo']' returned non-zero exit status 1.
Exited with code 1

Would it be possible to add the --timeout option to hubploy's upgrade call? I believe the default is 300s.

The text was updated successfully, but these errors were encountered:

yuvipanda · 2019-04-09T15:01:00Z

@jhamman I guess this is because the --wait makes it wait until the loadbalancer IP has become available. Makes sense to bump it - perhaps as an option in hubploy.yaml?

jhamman · 2019-04-09T16:11:43Z

@scottyhq - have you still been getting timeouts? I think things have actually stabilized...

scottyhq · 2019-04-23T23:48:02Z

we ran into this same error again with a new deployment https://circleci.com/gh/pangeo-data/pangeo-cloud-federation/480

our manual work-around is documented here pangeo-data/pangeo-cloud-federation#207

scottyhq · 2019-04-24T00:16:25Z

digging in a bit further this second time, it seems like the hub pod is stuck in pending b/c of the resources available (we are running two hubs - staging, prod - on a single m5.large instance). Here are some relevant commands and output

kubectl get pods --namespace nasa-prod -o wide

(aws) scott@pangeo-cloud-federation:kubectl get pods --namespace nasa-prod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE
hub-7bcfb9b656-nd4qt     0/1     Pending   0          13m   <none>          <none>                         <none>
proxy-549fc7cb4d-qd9fm   1/1     Running   0          13m   192.168.26.96   ip-192-168-14-7.ec2.internal   <none>

kubectl describe pod hub-7bcfb9b656-nd4qt --namespace nasa-prod

Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Warning  FailedScheduling   12m (x5 over 12m)     default-scheduler   pod has unbound immediate PersistentVolumeClaims
  Normal   NotTriggerScaleUp  2m18s (x60 over 12m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
  Warning  FailedScheduling   20s (x22 over 12m)    default-scheduler   0/3 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match node selector.

kubectl describe node ip-192-168-42-166.ec2.internal

 Kubelet Version:            v1.12.7
 Kube-Proxy Version:         v1.12.7
ProviderID:                  aws:///us-east-1d/i-0141530ac9c89f6b9
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                   CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                  ----                                   ------------  ----------   ---------------  -------------  ---
  kube-system                aws-node-pl5tz                         10m (0%)      0 (0%)       0 (0%)           0 (0%)         28h
  kube-system                cluster-autoscaler-85896b958b-w6ctt    100m (5%)     100m (5%)    300Mi (3%)       300Mi (3%)     24h
  kube-system                coredns-66bb8d6fdc-tlz8s               100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     28h
  kube-system                coredns-66bb8d6fdc-wznkm               100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     28h
  kube-system                kube-proxy-vgq6n                       100m (5%)     0 (0%)       0 (0%)           0 (0%)         28h
  kube-system                tiller-deploy-7b4c999868-tgxzk         0 (0%)        0 (0%)       0 (0%)           0 (0%)         25h
  nasa-staging               autohttps-5c855b856b-566mt             0 (0%)        0 (0%)       0 (0%)           0 (0%)         25h
  nasa-staging               hub-6d9f5c9bf8-kmj4b                   500m (25%)    1250m (62%)  1Gi (13%)        1Gi (13%)      87m
  nasa-staging               proxy-754bb98c67-49bvv                 200m (10%)    0 (0%)       512Mi (6%)       0 (0%)         25h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1110m (55%)   1350m (67%)
  memory                      1976Mi (26%)  1664Mi (21%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:                       <none>

although it seems like our hub should fit on there based on the requested resources... https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L44

scottyhq · 2019-04-24T00:23:24Z

also the nasa-prod proxy pod (proxy-549fc7cb4d-qd9fm) went onto a user-notebook pod...

  Namespace                  Name                      CPU Requests  CPU Limits  Memory Requests    Memory Limits      AGE
  ---------                  ----                      ------------  ----------  ---------------    -------------      ---
  kube-system                aws-node-k5w82            10m (0%)      0 (0%)      0 (0%)             0 (0%)             5h10m
  kube-system                kube-proxy-mjdn7          100m (1%)     0 (0%)      0 (0%)             0 (0%)             5h10m
  nasa-prod                  proxy-549fc7cb4d-qd9fm    200m (2%)     0 (0%)      512Mi (1%)         0 (0%)             25m
  nasa-staging               jupyter-scottyhq          3 (37%)       4 (50%)     15032385536 (45%)  17179869184 (52%)  99m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         3310m (41%)    4 (50%)
  memory                      14848Mi (47%)  17179869184 (52%)
  ephemeral-storage           0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0
Events:                       <none>

scottyhq · 2019-04-24T00:40:08Z

ok, so the situation is that the hub tries to go onto the existing instance that is in us-east-1d, but tries to attach nasa-prod/hub-db-dir which was created in us-east-1f :( so with the current setup we are just lucky when these things end up in the same region. here is a potential solution https://stackoverflow.com/questions/51946393/kubernetes-pod-warning-1-nodes-had-volume-node-affinity-conflict. or perhaps there is way to put hub-db-dir on the efs drive?

kubectl describe pv pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a

Name:              pvc-3b17f1f2-6623-11e9-aa8d-024800b3463a
Labels:            failure-domain.beta.kubernetes.io/region=us-east-1
                   failure-domain.beta.kubernetes.io/zone=us-east-1f
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             nasa-prod/hub-db-dir
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east-1f]
                   failure-domain.beta.kubernetes.io/region in [us-east-1]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1f/vol-08117fc164f9ee9ca
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

yuvipanda · 2019-04-30T17:30:32Z

I'd recommend not putting the hub pod on the EFS drive, since sqlite and NFS don't mix very well.

Not sure why EKS / kops is putting EBS volumes in a different zone than the k8s cluster :(

scottyhq · 2019-04-30T17:36:48Z

Thanks for the feedback @yuvipanda! We are currently running staging.nasa.pangeo.io entirely on EFS. So We'll see if there are issues there with the Hub database that come up... I ended up setting up our 'default' provisioner to https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs (instead of EBS/gp2).

It seems like the EBS going into different zones is a well known issue. Since EKS added Kuberenetes >1.12 recently. we can now use "topology-aware volume provisioning" if need be

yuvipanda · 2019-04-30T17:38:57Z

@scottyhq cool! with sqlite on NFS, the thing to watch out for is times when hub response times spike up, often to something like 1 - 5seconds per request, cascading pretty badly quickly. This hit us at Berkeley earlier, and the 'fix' was to move off shared storage. Otherwise, for the kinda workload we have it works fine.

scottyhq mentioned this issue Apr 5, 2019

changed load balancer ip pangeo-data/pangeo-cloud-federation#207

Merged

jameswinegar mentioned this issue Jul 15, 2019

new w261 cluster berkeley-dsep-infra/datahub#826

Merged

scottyhq mentioned this issue Mar 12, 2020

Add --cleanup-on-fail option to default helm upgrade command? #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase default timeout in helm upgrade command #17

increase default timeout in helm upgrade command #17

jhamman commented Apr 4, 2019

yuvipanda commented Apr 9, 2019

jhamman commented Apr 9, 2019

scottyhq commented Apr 23, 2019 •

edited

Loading

scottyhq commented Apr 24, 2019

scottyhq commented Apr 24, 2019

scottyhq commented Apr 24, 2019

yuvipanda commented Apr 30, 2019

scottyhq commented Apr 30, 2019

yuvipanda commented Apr 30, 2019

increase default timeout in helm upgrade command #17

increase default timeout in helm upgrade command #17

Comments

jhamman commented Apr 4, 2019

yuvipanda commented Apr 9, 2019

jhamman commented Apr 9, 2019

scottyhq commented Apr 23, 2019 • edited Loading

scottyhq commented Apr 24, 2019

scottyhq commented Apr 24, 2019

scottyhq commented Apr 24, 2019

yuvipanda commented Apr 30, 2019

scottyhq commented Apr 30, 2019

yuvipanda commented Apr 30, 2019

scottyhq commented Apr 23, 2019 •

edited

Loading