Bug 1589015 - Fail to upgrade web-console due to old terminating pod prevent new pod created during upgrade
Summary: Fail to upgrade web-console due to old terminating pod prevent new pod create...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.10.0
Assignee: Samuel Padgett
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-08 08:27 UTC by liujia
Modified: 2018-07-23 13:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-18 13:05:47 UTC
Target Upstream Version:


Attachments (Terms of Use)
webconsole-pod (4.08 KB, text/plain)
2018-06-12 07:15 UTC, liujia
no flags Details

Description liujia 2018-06-08 08:27:12 UTC
Description of problem:
Upgrade against ocp deployed on atomic hosts(v3.9-v3.10) failed at task 
TASK [openshift_web_console : Verify that the console is running] **************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/start.yml:2
Friday 08 June 2018  05:13:57 +0000 (0:00:00.059)       0:40:38.233 *********** 
FAILED - RETRYING: Verify that the console is running (60 retries left).
...
FAILED - RETRYING: Verify that the console is running (1 retries left).
fatal: [x]: FAILED! => {"attempts": 60, "changed": false, "failed": true, "results": {"cmd": "/usr/local/bin/oc get deployment webconsole -o json -n openshift-web-console", "results": [{"apiVersion": "extensions/v1beta1", "kind": "Deployment", "metadata": {"annotations": {"deployment.kubernetes.io/revision": "1", "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apps/v1beta1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"openshift-web-console\",\"webconsole\":\"true\"},\"name\":\"webconsole\",\"namespace\":\"openshift-web-console\"},\"spec\":{\"replicas\":3,\"strategy\":{\"type\":\"Recreate\"},\"template\":{\"metadata\":{\"labels\":{\"app\":\"openshift-web-console\",\"webconsole\":\"true\"},\"name\":\"webconsole\"},\"spec\":{\"containers\":[{\"command\":[\"/usr/bin/origin-web-console\",\"--audit-log-path=-\",\"-v=0\",\"--config=/var/webconsole-config/webconsole-config.yaml\"],\"image\":\"registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10\",\"imagePullPolicy\":\"IfNotPresent\",\"livenessProbe\":{\"exec\":{\"command\":[\"/bin/sh\",\"-c\",\"if [[ ! -f /tmp/webconsole-config.hash ]]; then \\\\\\n  md5sum /var/webconsole-config/webconsole-config.yaml \\u003e /tmp/webconsole-config.hash; \\\\\\nelif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \\\\\\n  echo 'webconsole-config.yaml has changed.'; \\\\\\n  exit 1; \\\\\\nfi \\u0026\\u0026 curl -k -f https://0.0.0.0:8443/console/\"]}},\"name\":\"webconsole\",\"ports\":[{\"containerPort\":8443}],\"readinessProbe\":{\"httpGet\":{\"path\":\"/healthz\",\"port\":8443,\"scheme\":\"HTTPS\"}},\"resources\":{\"requests\":{\"cpu\":\"100m\",\"memory\":\"100Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/var/serving-cert\",\"name\":\"serving-cert\"},{\"mountPath\":\"/var/webconsole-config\",\"name\":\"webconsole-config\"}]}],\"nodeSelector\":{\"node-role.kubernetes.io/master\":\"true\"},\"serviceAccountName\":\"webconsole\",\"volumes\":[{\"name\":\"serving-cert\",\"secret\":{\"defaultMode\":288,\"secretName\":\"webconsole-serving-cert\"}},{\"configMap\":{\"defaultMode\":288,\"name\":\"webconsole-config\"},\"name\":\"webconsole-config\"}]}}}}\n"}, "creationTimestamp": "2018-06-08T02:34:10Z", "generation": 2, "labels": {"app": "openshift-web-console", "webconsole": "true"}, "name": "webconsole", "namespace": "openshift-web-console", "resourceVersion": "41469", "selfLink": "/apis/extensions/v1beta1/namespaces/openshift-web-console/deployments/webconsole", "uid": "6c846bd2-6ac4-11e8-8cee-42010af00024"}, "spec": {"progressDeadlineSeconds": 600, "replicas": 3, "revisionHistoryLimit": 2, "selector": {"matchLabels": {"app": "openshift-web-console", "webconsole": "true"}}, "strategy": {"type": "Recreate"}, "template": {"metadata": {"creationTimestamp": null, "labels": {"app": "openshift-web-console", "webconsole": "true"}, "name": "webconsole"}, "spec": {"containers": [{"command": ["/usr/bin/origin-web-console", "--audit-log-path=-", "-v=0", "--config=/var/webconsole-config/webconsole-config.yaml"], "image": "registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10", "imagePullPolicy": "IfNotPresent", "livenessProbe": {"exec": {"command": ["/bin/sh", "-c", "if [[ ! -f /tmp/webconsole-config.hash ]]; then \\\n  md5sum /var/webconsole-config/webconsole-config.yaml > /tmp/webconsole-config.hash; \\\nelif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \\\n  echo 'webconsole-config.yaml has changed.'; \\\n  exit 1; \\\nfi && curl -k -f https://0.0.0.0:8443/console/"]}, "failureThreshold": 3, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 1}, "name": "webconsole", "ports": [{"containerPort": 8443, "protocol": "TCP"}], "readinessProbe": {"failureThreshold": 3, "httpGet": {"path": "/healthz", "port": 8443, "scheme": "HTTPS"}, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 1}, "resources": {"requests": {"cpu": "100m", "memory": "100Mi"}}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/var/serving-cert", "name": "serving-cert"}, {"mountPath": "/var/webconsole-config", "name": "webconsole-config"}]}], "dnsPolicy": "ClusterFirst", "nodeSelector": {"node-role.kubernetes.io/master": "true"}, "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "webconsole", "serviceAccountName": "webconsole", "terminationGracePeriodSeconds": 30, "volumes": [{"name": "serving-cert", "secret": {"defaultMode": 288, "secretName": "webconsole-serving-cert"}}, {"configMap": {"defaultMode": 288, "name": "webconsole-config"}, "name": "webconsole-config"}]}}}, "status": {"conditions": [{"lastTransitionTime": "2018-06-08T02:34:10Z", "lastUpdateTime": "2018-06-08T02:34:28Z", "message": "ReplicaSet \"webconsole-675478f597\" has successfully progressed.", "reason": "NewReplicaSetAvailable", "status": "True", "type": "Progressing"}, {"lastTransitionTime": "2018-06-08T05:16:22Z", "lastUpdateTime": "2018-06-08T05:16:22Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available"}], "observedGeneration": 2}}], "returncode": 0}, "state": "list"}
...ignoring

============
Although deployment has been updated, but new v3.10 webconsole pod was not deployed successfully because old v3.9 webconsole pod keep terminating status. 

# oc get pod
NAME                          READY     STATUS        RESTARTS   AGE
webconsole-675478f597-zq6cn   0/1       Terminating   1          5h

# oc describe pod webconsole-675478f597-zq6cn|grep Image
    Image:         registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.30
    Image ID:      docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-web-console@sha256:2a8f2fb074ef7517a9230f3dd7603f217168d055ff3709859817c66f59afb7a2

# oc describe deployment webconsole|grep Image
    Image:      registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10

# docker images|grep web
registry.reg-aws.openshift.com:443/openshift3/ose-web-console       v3.9.30             873584bc0826        8 days ago          466 MB

Version-Release number of the following components:
openshift-ansible-3.10.0-0.64.0.git.20.48df973.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. HA install ocp v3.9 on atomic hosts.
2. Run upgrade against above ocp
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Upgrade log in https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/AtomicOpenshiftUpdate/524/consoleFull

Comment 1 Samuel Padgett 2018-06-08 12:54:51 UTC
It looks like the underlying cause is this error:

```
        "2h          2h           3         webconsole-675478f597-9xftl.1536110203a6e4e6   Pod                                     Warning   FailedKillPod            kubelet, qe-zitang-39hamaster-etcd-zone2-1   error killing pod: failed to \"KillPodSandbox\" for \"6d2b9352-6ac4-11e8-9c35-42010af00025\" with KillPodSandboxError: \"rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \\\"webconsole-675478f597-9xftl_openshift-web-console\\\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused\"", 
```

The drawback to recreate deployments is that any problem terminating a pod will completely block the rollout.

We might want to give more time after the masters are upgraded before trying to rollout the new console deployment. We've seen several network-related failures since the masters have just been upgraded and are coming back up.

Seth, Tomas -- Have you guys seen issues like this? Any ideas on how to work around it?

Comment 3 openshift-github-bot 2018-06-08 21:20:44 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/3cbb898fd97e8a5bfcefa822c6182d2da3c5d5d4
Bug 1589015 - Switch to rolling deployment for web console

Switch to a RollingUpdate deployment strategy for the web console. This
avoids the rollout getting blocked when a pod doesn't terminate.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1589015

https://github.com/openshift/openshift-ansible/commit/1698cb3204b54d01ef9ccd70eb6e0ba37bc5df04
Merge pull request #8694 from spadgett/console-rolling-deployment

Bug 1589015 - Switch to rolling deployment for web console

Comment 4 Tomáš Nožička 2018-06-11 10:43:29 UTC
> Seth, Tomas -- Have you guys seen issues like this? Any ideas on how to work around it?

What's the Status.Phase of the pod in question? If that's still Running or unknown Deployment can't launch new Pods, not to break the guarantee. With DC we timeout but I am not sure if that's for better or worse.

Is this really a reason to switch strategy? (Although that's probably fine.) The API object has been updated so the installation went fine, and if we wait for the new version to come up and it doesn't due to container runtime, I don't quite see the difference between failing to launch a new Pod or failing to delete the old one so I'd not consider this related to a particular strategy.

Comment 5 Samuel Padgett 2018-06-11 13:10:00 UTC
Tomas, the pod was stuck in Terminating with the error from comment #1.

> Is this really a reason to switch strategy?

In my testing, RollingUpdate with maxUnavailable: 100% seems to work better for us even ignoring this problem. It behaves essentially like Recreate without waiting for the pods to terminate, which is exactly what we want. The downtime blip for the deployment is much shorter, and there's no real cleanup we do on termination since the console is essentially just a file server. (Admittedly, RollingUpdate and maxUnavailable: 100% is a bit of a hack.)

What we really want is a blue/green deployment, though.

If we switch to routes instead of a proxy, we could just RollingUpdate without the maxUnavailable hack since we could have sticky sessions. Right now the master proxy that the console uses doesn't give us session affinity.

The worry I have with Recreate is that if any  pod is stuck in Terminating, the rollout *never* progresses. You're forever stuck. We've seen this a couple times now with the evicted pods and now this error.

Comment 6 Tomáš Nožička 2018-06-11 13:23:36 UTC
> Tomas, the pod was stuck in Terminating with the error from comment #1.

Terminating is not a phase, it just means that the pod has a deletion timestamp set. I am looking for Status.Phase.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase

Comment 7 Samuel Padgett 2018-06-11 13:32:25 UTC
Sorry, you're right. The only info is what I have in the log.

https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/AtomicOpenshiftUpdate/524/consoleFull

jiajliu@redhat.com - Any chance the pod still exists so we can get the YAML?

Comment 8 Tomáš Nožička 2018-06-12 06:55:37 UTC
jiajliu@redhat.com - please dump the yaml for that pod if you get to encounter it again.


> In my testing, RollingUpdate with maxUnavailable: 100% seems to work better for us even ignoring this problem. It behaves essentially like Recreate without waiting for the pods to terminate, which is exactly what we want. The downtime blip for the deployment is much shorter, and there's no real cleanup we do on termination since the console is essentially just a file server. (Admittedly, RollingUpdate and maxUnavailable: 100% is a bit of a hack.)

Isn't the issue with this approach the fact that it doesn't wait for the old pod to terminate? If there are 2 of them running at the same time console can load parts of it source code from different version? It likely manages to terminate the old one before the new one comes up but relying on that seems racy.

> What we really want is a blue/green deployment, though.

Ansible can create a second deployment, and switch the Sevice's selector afterwards. Or an operator when we get it.

> If we switch to routes instead of a proxy, we could just RollingUpdate without the maxUnavailable hack since we could have sticky sessions. Right now the master proxy that the console uses doesn't give us session affinity.

+1; I think stickiness this can be also hacked by making e.g. git commit part of the requested URL till then but you'd probably need to retry

> The worry I have with Recreate is that if any  pod is stuck in Terminating, the rollout *never* progresses. You're forever stuck.

Well, it means the container runtime is broken. My point was that optimizing for container runtime breakage seems pointless. We should fix the runtime. It will break statefulsets and tons of other stuff or it can break in a way that it won't even run pods. No way to optimize for that.

> We've seen this a couple times now with the evicted pods and now this error.

Evicted pods were a Deployments bug, fixed now.

Comment 9 liujia 2018-06-12 07:15:22 UTC
Created attachment 1450346 [details]
webconsole-pod

Comment 10 Tomáš Nožička 2018-06-12 07:37:30 UTC
Sorry, I should have been clearer - we were talking about the 3.9 pod stuck in terminating (webconsole-675478f597-zq6cn) - this one was 3.10 (interestingly in Pending state)

Comment 11 liujia 2018-06-12 07:45:05 UTC
(In reply to Tomáš Nožička from comment #10)
> Sorry, I should have been clearer - we were talking about the 3.9 pod stuck
> in terminating (webconsole-675478f597-zq6cn) - this one was 3.10
> (interestingly in Pending state)

Ok. I will dump the yaml file when I hit the issue again. This yaml file is dump from another fail of webconsole upgrade. I thought they were the same.

Comment 12 Samuel Padgett 2018-06-12 12:52:05 UTC
> Isn't the issue with this approach the fact that it doesn't wait for the old pod to terminate?

My understanding is the pod is removed from the endpoints list for services right away when deletion timestamp is set. Worst case is that console doesn't load due to conflicting JS, which is pretty much the same as getting a 503 unavailable using a recreate deployment, but I don't think it would happen.

This is closer to the behavior we want regardless, even ignoring this particular failure. The new pods start almost immediately when they don't have to wait for the old pods to terminate. So instead of 30s downtime, there's just a small blip.

> Ansible can create a second deployment, and switch the Sevice's selector afterwards. Or an operator when we get it.

When this was discussed during the group 2 architecture call, we were asked not do that and instead push for the features we need in the platform.

> Well, it means the container runtime is broken.

Does it always? Even if that's true on one node, couldn't the pod start on another?

> Evicted pods were a Deployments bug, fixed now.

Understood, and I agree we should try to fix the root problems we find.

Comment 14 liujia 2018-06-15 09:08:07 UTC
Verified on openshift-ansible-3.10.0-0.69.0.git.127.3ca07e5.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.