Description of problem: Upgrade against ocp deployed on atomic hosts(v3.9-v3.10) failed at task TASK [openshift_web_console : Verify that the console is running] ************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/start.yml:2 Friday 08 June 2018 05:13:57 +0000 (0:00:00.059) 0:40:38.233 *********** FAILED - RETRYING: Verify that the console is running (60 retries left). ... FAILED - RETRYING: Verify that the console is running (1 retries left). fatal: [x]: FAILED! => {"attempts": 60, "changed": false, "failed": true, "results": {"cmd": "/usr/local/bin/oc get deployment webconsole -o json -n openshift-web-console", "results": [{"apiVersion": "extensions/v1beta1", "kind": "Deployment", "metadata": {"annotations": {"deployment.kubernetes.io/revision": "1", "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apps/v1beta1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"openshift-web-console\",\"webconsole\":\"true\"},\"name\":\"webconsole\",\"namespace\":\"openshift-web-console\"},\"spec\":{\"replicas\":3,\"strategy\":{\"type\":\"Recreate\"},\"template\":{\"metadata\":{\"labels\":{\"app\":\"openshift-web-console\",\"webconsole\":\"true\"},\"name\":\"webconsole\"},\"spec\":{\"containers\":[{\"command\":[\"/usr/bin/origin-web-console\",\"--audit-log-path=-\",\"-v=0\",\"--config=/var/webconsole-config/webconsole-config.yaml\"],\"image\":\"registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10\",\"imagePullPolicy\":\"IfNotPresent\",\"livenessProbe\":{\"exec\":{\"command\":[\"/bin/sh\",\"-c\",\"if [[ ! -f /tmp/webconsole-config.hash ]]; then \\\\\\n md5sum /var/webconsole-config/webconsole-config.yaml \\u003e /tmp/webconsole-config.hash; \\\\\\nelif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \\\\\\n echo 'webconsole-config.yaml has changed.'; \\\\\\n exit 1; \\\\\\nfi \\u0026\\u0026 curl -k -f https://0.0.0.0:8443/console/\"]}},\"name\":\"webconsole\",\"ports\":[{\"containerPort\":8443}],\"readinessProbe\":{\"httpGet\":{\"path\":\"/healthz\",\"port\":8443,\"scheme\":\"HTTPS\"}},\"resources\":{\"requests\":{\"cpu\":\"100m\",\"memory\":\"100Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/var/serving-cert\",\"name\":\"serving-cert\"},{\"mountPath\":\"/var/webconsole-config\",\"name\":\"webconsole-config\"}]}],\"nodeSelector\":{\"node-role.kubernetes.io/master\":\"true\"},\"serviceAccountName\":\"webconsole\",\"volumes\":[{\"name\":\"serving-cert\",\"secret\":{\"defaultMode\":288,\"secretName\":\"webconsole-serving-cert\"}},{\"configMap\":{\"defaultMode\":288,\"name\":\"webconsole-config\"},\"name\":\"webconsole-config\"}]}}}}\n"}, "creationTimestamp": "2018-06-08T02:34:10Z", "generation": 2, "labels": {"app": "openshift-web-console", "webconsole": "true"}, "name": "webconsole", "namespace": "openshift-web-console", "resourceVersion": "41469", "selfLink": "/apis/extensions/v1beta1/namespaces/openshift-web-console/deployments/webconsole", "uid": "6c846bd2-6ac4-11e8-8cee-42010af00024"}, "spec": {"progressDeadlineSeconds": 600, "replicas": 3, "revisionHistoryLimit": 2, "selector": {"matchLabels": {"app": "openshift-web-console", "webconsole": "true"}}, "strategy": {"type": "Recreate"}, "template": {"metadata": {"creationTimestamp": null, "labels": {"app": "openshift-web-console", "webconsole": "true"}, "name": "webconsole"}, "spec": {"containers": [{"command": ["/usr/bin/origin-web-console", "--audit-log-path=-", "-v=0", "--config=/var/webconsole-config/webconsole-config.yaml"], "image": "registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10", "imagePullPolicy": "IfNotPresent", "livenessProbe": {"exec": {"command": ["/bin/sh", "-c", "if [[ ! -f /tmp/webconsole-config.hash ]]; then \\\n md5sum /var/webconsole-config/webconsole-config.yaml > /tmp/webconsole-config.hash; \\\nelif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \\\n echo 'webconsole-config.yaml has changed.'; \\\n exit 1; \\\nfi && curl -k -f https://0.0.0.0:8443/console/"]}, "failureThreshold": 3, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 1}, "name": "webconsole", "ports": [{"containerPort": 8443, "protocol": "TCP"}], "readinessProbe": {"failureThreshold": 3, "httpGet": {"path": "/healthz", "port": 8443, "scheme": "HTTPS"}, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 1}, "resources": {"requests": {"cpu": "100m", "memory": "100Mi"}}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/var/serving-cert", "name": "serving-cert"}, {"mountPath": "/var/webconsole-config", "name": "webconsole-config"}]}], "dnsPolicy": "ClusterFirst", "nodeSelector": {"node-role.kubernetes.io/master": "true"}, "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "webconsole", "serviceAccountName": "webconsole", "terminationGracePeriodSeconds": 30, "volumes": [{"name": "serving-cert", "secret": {"defaultMode": 288, "secretName": "webconsole-serving-cert"}}, {"configMap": {"defaultMode": 288, "name": "webconsole-config"}, "name": "webconsole-config"}]}}}, "status": {"conditions": [{"lastTransitionTime": "2018-06-08T02:34:10Z", "lastUpdateTime": "2018-06-08T02:34:28Z", "message": "ReplicaSet \"webconsole-675478f597\" has successfully progressed.", "reason": "NewReplicaSetAvailable", "status": "True", "type": "Progressing"}, {"lastTransitionTime": "2018-06-08T05:16:22Z", "lastUpdateTime": "2018-06-08T05:16:22Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available"}], "observedGeneration": 2}}], "returncode": 0}, "state": "list"} ...ignoring ============ Although deployment has been updated, but new v3.10 webconsole pod was not deployed successfully because old v3.9 webconsole pod keep terminating status. # oc get pod NAME READY STATUS RESTARTS AGE webconsole-675478f597-zq6cn 0/1 Terminating 1 5h # oc describe pod webconsole-675478f597-zq6cn|grep Image Image: registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.30 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-web-console@sha256:2a8f2fb074ef7517a9230f3dd7603f217168d055ff3709859817c66f59afb7a2 # oc describe deployment webconsole|grep Image Image: registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.10 # docker images|grep web registry.reg-aws.openshift.com:443/openshift3/ose-web-console v3.9.30 873584bc0826 8 days ago 466 MB Version-Release number of the following components: openshift-ansible-3.10.0-0.64.0.git.20.48df973.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. HA install ocp v3.9 on atomic hosts. 2. Run upgrade against above ocp 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Upgrade log in https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/AtomicOpenshiftUpdate/524/consoleFull
It looks like the underlying cause is this error: ``` "2h 2h 3 webconsole-675478f597-9xftl.1536110203a6e4e6 Pod Warning FailedKillPod kubelet, qe-zitang-39hamaster-etcd-zone2-1 error killing pod: failed to \"KillPodSandbox\" for \"6d2b9352-6ac4-11e8-9c35-42010af00025\" with KillPodSandboxError: \"rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \\\"webconsole-675478f597-9xftl_openshift-web-console\\\" network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: connection refused\"", ``` The drawback to recreate deployments is that any problem terminating a pod will completely block the rollout. We might want to give more time after the masters are upgraded before trying to rollout the new console deployment. We've seen several network-related failures since the masters have just been upgraded and are coming back up. Seth, Tomas -- Have you guys seen issues like this? Any ideas on how to work around it?
https://github.com/openshift/openshift-ansible/pull/8694
Commits pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/3cbb898fd97e8a5bfcefa822c6182d2da3c5d5d4 Bug 1589015 - Switch to rolling deployment for web console Switch to a RollingUpdate deployment strategy for the web console. This avoids the rollout getting blocked when a pod doesn't terminate. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1589015 https://github.com/openshift/openshift-ansible/commit/1698cb3204b54d01ef9ccd70eb6e0ba37bc5df04 Merge pull request #8694 from spadgett/console-rolling-deployment Bug 1589015 - Switch to rolling deployment for web console
> Seth, Tomas -- Have you guys seen issues like this? Any ideas on how to work around it? What's the Status.Phase of the pod in question? If that's still Running or unknown Deployment can't launch new Pods, not to break the guarantee. With DC we timeout but I am not sure if that's for better or worse. Is this really a reason to switch strategy? (Although that's probably fine.) The API object has been updated so the installation went fine, and if we wait for the new version to come up and it doesn't due to container runtime, I don't quite see the difference between failing to launch a new Pod or failing to delete the old one so I'd not consider this related to a particular strategy.
Tomas, the pod was stuck in Terminating with the error from comment #1. > Is this really a reason to switch strategy? In my testing, RollingUpdate with maxUnavailable: 100% seems to work better for us even ignoring this problem. It behaves essentially like Recreate without waiting for the pods to terminate, which is exactly what we want. The downtime blip for the deployment is much shorter, and there's no real cleanup we do on termination since the console is essentially just a file server. (Admittedly, RollingUpdate and maxUnavailable: 100% is a bit of a hack.) What we really want is a blue/green deployment, though. If we switch to routes instead of a proxy, we could just RollingUpdate without the maxUnavailable hack since we could have sticky sessions. Right now the master proxy that the console uses doesn't give us session affinity. The worry I have with Recreate is that if any pod is stuck in Terminating, the rollout *never* progresses. You're forever stuck. We've seen this a couple times now with the evicted pods and now this error.
> Tomas, the pod was stuck in Terminating with the error from comment #1. Terminating is not a phase, it just means that the pod has a deletion timestamp set. I am looking for Status.Phase. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Sorry, you're right. The only info is what I have in the log. https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/AtomicOpenshiftUpdate/524/consoleFull jiajliu - Any chance the pod still exists so we can get the YAML?
jiajliu - please dump the yaml for that pod if you get to encounter it again. > In my testing, RollingUpdate with maxUnavailable: 100% seems to work better for us even ignoring this problem. It behaves essentially like Recreate without waiting for the pods to terminate, which is exactly what we want. The downtime blip for the deployment is much shorter, and there's no real cleanup we do on termination since the console is essentially just a file server. (Admittedly, RollingUpdate and maxUnavailable: 100% is a bit of a hack.) Isn't the issue with this approach the fact that it doesn't wait for the old pod to terminate? If there are 2 of them running at the same time console can load parts of it source code from different version? It likely manages to terminate the old one before the new one comes up but relying on that seems racy. > What we really want is a blue/green deployment, though. Ansible can create a second deployment, and switch the Sevice's selector afterwards. Or an operator when we get it. > If we switch to routes instead of a proxy, we could just RollingUpdate without the maxUnavailable hack since we could have sticky sessions. Right now the master proxy that the console uses doesn't give us session affinity. +1; I think stickiness this can be also hacked by making e.g. git commit part of the requested URL till then but you'd probably need to retry > The worry I have with Recreate is that if any pod is stuck in Terminating, the rollout *never* progresses. You're forever stuck. Well, it means the container runtime is broken. My point was that optimizing for container runtime breakage seems pointless. We should fix the runtime. It will break statefulsets and tons of other stuff or it can break in a way that it won't even run pods. No way to optimize for that. > We've seen this a couple times now with the evicted pods and now this error. Evicted pods were a Deployments bug, fixed now.
Created attachment 1450346 [details] webconsole-pod
Sorry, I should have been clearer - we were talking about the 3.9 pod stuck in terminating (webconsole-675478f597-zq6cn) - this one was 3.10 (interestingly in Pending state)
(In reply to Tomáš Nožička from comment #10) > Sorry, I should have been clearer - we were talking about the 3.9 pod stuck > in terminating (webconsole-675478f597-zq6cn) - this one was 3.10 > (interestingly in Pending state) Ok. I will dump the yaml file when I hit the issue again. This yaml file is dump from another fail of webconsole upgrade. I thought they were the same.
> Isn't the issue with this approach the fact that it doesn't wait for the old pod to terminate? My understanding is the pod is removed from the endpoints list for services right away when deletion timestamp is set. Worst case is that console doesn't load due to conflicting JS, which is pretty much the same as getting a 503 unavailable using a recreate deployment, but I don't think it would happen. This is closer to the behavior we want regardless, even ignoring this particular failure. The new pods start almost immediately when they don't have to wait for the old pods to terminate. So instead of 30s downtime, there's just a small blip. > Ansible can create a second deployment, and switch the Sevice's selector afterwards. Or an operator when we get it. When this was discussed during the group 2 architecture call, we were asked not do that and instead push for the features we need in the platform. > Well, it means the container runtime is broken. Does it always? Even if that's true on one node, couldn't the pod start on another? > Evicted pods were a Deployments bug, fixed now. Understood, and I agree we should try to fix the root problems we find.
Verified on openshift-ansible-3.10.0-0.69.0.git.127.3ca07e5.el7.noarch