Description of problem: upgrade failed at TASK [etcd : Verify cluster is healthy pre-upgrade] likely caused by ovs pods in terminating status Version-Release number of the following components: openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch How reproducible: 50% Steps to Reproduce: 1. upgrade OCP v3.10 HA cluster to v3.11 Actual results: TASK [etcd : Verify cluster is healthy pre-upgrade] **************************** task path: /home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/etcd/tasks/upgrade_static.yml:6 Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py <upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> ESTABLISH SSH CONNECTION FOR USER: root <upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave2/.ansible/cp/%C upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' <upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> (1, '\n{"changed": true, "end": "2018-08-28 06:36:08.744274", "stdout": "cluster may be unhealthy: failed to list members", "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://upgrade3-wmeng-master-etcd-1:2379", "cluster-health"], "failed": true, "delta": "0:00:02.217725", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout\\n\\nerror #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", "rc": 4, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C https://upgrade3-wmeng-master-etcd-1:2379 cluster-health", "removes": null, "argv": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2018-08-28 06:36:06.526549", "msg": "non-zero return code"}\n', '') fatal: [upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com]: FAILED! => { "changed": true, "cmd": [ "/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://upgrade3-wmeng-master-etcd-1:2379", "cluster-health" ], "delta": "0:00:02.217725", "end": "2018-08-28 06:36:08.744274", "invocation": { "module_args": { "_raw_params": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C https://upgrade3-wmeng-master-etcd-1:2379 cluster-health", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 4, "start": "2018-08-28 06:36:06.526549", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout\n\nerror #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", "stderr_lines": [ "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", "", "error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout" ], "stdout": "cluster may be unhealthy: failed to list members", "stdout_lines": [ "cluster may be unhealthy: failed to list members" ] } to retry, use: --limit @/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry Expected results: upgrade succeeds Additional info: [root@upgrade3-wmeng-master-etcd-2 ~]# oc get pod --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default docker-registry-1-xqmdl 1/1 Running 2 6h 10.2.6.5 upgrade3-wmeng-node-infra-1 default registry-console-1-7xb8b 1/1 Running 0 6h 10.2.2.2 upgrade3-wmeng-master-etcd-1 default router-1-642b9 1/1 Running 0 6h 10.240.0.151 upgrade3-wmeng-node-infra-2 default router-1-c2tgt 1/1 Running 0 6h 10.240.0.150 upgrade3-wmeng-node-infra-1 install-test mongodb-1-np62b 1/1 Running 3 6h 10.2.12.9 upgrade3-wmeng-node-1 install-test nodejs-mongodb-example-1-build 0/1 Completed 0 6h 10.2.12.3 upgrade3-wmeng-node-1 install-test nodejs-mongodb-example-1-rbqdk 1/1 Running 2 6h 10.2.12.8 upgrade3-wmeng-node-1 kube-service-catalog apiserver-2fb5b 1/1 Running 0 6h 10.2.4.5 upgrade3-wmeng-master-etcd-2 kube-service-catalog apiserver-vk5vb 1/1 Running 0 6h 10.2.2.4 upgrade3-wmeng-master-etcd-1 kube-service-catalog apiserver-vtfdg 1/1 Running 1 6h 10.2.0.6 upgrade3-wmeng-master-etcd-3 kube-service-catalog controller-manager-8x75f 1/1 Running 0 6h 10.2.2.5 upgrade3-wmeng-master-etcd-1 kube-service-catalog controller-manager-gb9jh 1/1 Running 0 6h 10.2.4.6 upgrade3-wmeng-master-etcd-2 kube-service-catalog controller-manager-xlbgs 1/1 Running 2 6h 10.2.0.7 upgrade3-wmeng-master-etcd-3 kube-system master-api-upgrade3-wmeng-master-etcd-1 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-1 kube-system master-api-upgrade3-wmeng-master-etcd-2 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-2 kube-system master-api-upgrade3-wmeng-master-etcd-3 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-3 kube-system master-controllers-upgrade3-wmeng-master-etcd-1 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-1 kube-system master-controllers-upgrade3-wmeng-master-etcd-2 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-2 kube-system master-controllers-upgrade3-wmeng-master-etcd-3 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-3 kube-system master-etcd-upgrade3-wmeng-master-etcd-1 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-1 kube-system master-etcd-upgrade3-wmeng-master-etcd-2 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-2 kube-system master-etcd-upgrade3-wmeng-master-etcd-3 0/1 Pending 0 4h <none> upgrade3-wmeng-master-etcd-3 openshift-ansible-service-broker asb-1-z68vx 1/1 Running 3 6h 10.2.10.8 upgrade3-wmeng-node-infra-2 openshift-infra hawkular-cassandra-1-wn674 1/1 Running 2 6h 10.2.10.9 upgrade3-wmeng-node-infra-2 openshift-infra hawkular-metrics-7dhsm 1/1 Running 1 6h 10.2.8.8 upgrade3-wmeng-node-2 openshift-infra hawkular-metrics-schema-kg2br 0/1 Completed 0 6h 10.2.8.2 upgrade3-wmeng-node-2 openshift-infra heapster-gw9tr 1/1 Running 0 6h 10.2.4.4 upgrade3-wmeng-master-etcd-2 openshift-metrics-server metrics-server-84b6c5c786-8c5c8 1/1 Running 1 4h 10.2.8.7 upgrade3-wmeng-node-2 openshift-node sync-j582j 1/1 Running 0 4h 10.240.0.153 upgrade3-wmeng-node-1 openshift-node sync-jrzph 1/1 Running 0 4h 10.240.0.151 upgrade3-wmeng-node-infra-2 openshift-node sync-kvdr9 1/1 Running 0 4h 10.240.0.154 upgrade3-wmeng-node-2 openshift-node sync-rgc49 1/1 Running 0 4h 10.240.0.150 upgrade3-wmeng-node-infra-1 openshift-node sync-swzd5 1/1 Running 0 4h 10.240.0.143 upgrade3-wmeng-master-etcd-1 openshift-node sync-w2d4f 1/1 Running 0 4h 10.240.0.144 upgrade3-wmeng-master-etcd-2 openshift-node sync-wdx5k 1/1 Running 0 4h 10.240.0.145 upgrade3-wmeng-master-etcd-3 openshift-sdn ovs-4wfzp 1/1 Running 0 4h 10.240.0.150 upgrade3-wmeng-node-infra-1 openshift-sdn ovs-6np82 1/1 Running 0 4h 10.240.0.151 upgrade3-wmeng-node-infra-2 openshift-sdn ovs-99dr7 1/1 Running 0 6h 10.240.0.144 upgrade3-wmeng-master-etcd-2 openshift-sdn ovs-99zzs 1/1 Running 0 4h 10.240.0.154 upgrade3-wmeng-node-2 openshift-sdn ovs-fgb2w 1/1 Terminating 0 6h 10.240.0.145 upgrade3-wmeng-master-etcd-3 openshift-sdn ovs-t9n82 1/1 Terminating 0 6h 10.240.0.143 upgrade3-wmeng-master-etcd-1 openshift-sdn ovs-tb6hd 1/1 Running 0 4h 10.240.0.153 upgrade3-wmeng-node-1 openshift-sdn sdn-2r4sn 1/1 Running 1 6h 10.240.0.154 upgrade3-wmeng-node-2 openshift-sdn sdn-74gms 1/1 Running 1 4h 10.240.0.151 upgrade3-wmeng-node-infra-2 openshift-sdn sdn-7gpt5 1/1 Running 0 4h 10.240.0.153 upgrade3-wmeng-node-1 openshift-sdn sdn-9smsx 1/1 Running 0 4h 10.240.0.150 upgrade3-wmeng-node-infra-1 openshift-sdn sdn-fwm68 1/1 Running 0 4h 10.240.0.145 upgrade3-wmeng-master-etcd-3 openshift-sdn sdn-nx4m4 1/1 Running 0 6h 10.240.0.144 upgrade3-wmeng-master-etcd-2 openshift-sdn sdn-zrzp4 1/1 Terminating 0 6h 10.240.0.143 upgrade3-wmeng-master-etcd-1 openshift-template-service-broker apiserver-f78lc 1/1 Running 0 6h 10.2.4.7 upgrade3-wmeng-master-etcd-2 openshift-template-service-broker apiserver-t6q84 1/1 Running 1 6h 10.2.0.8 upgrade3-wmeng-master-etcd-3 openshift-template-service-broker apiserver-v2569 1/1 Running 0 6h 10.2.2.6 upgrade3-wmeng-master-etcd-1 openshift-web-console webconsole-785689b664-nxwdm 1/1 Running 2 6h 10.2.0.9 upgrade3-wmeng-master-etcd-3 openshift-web-console webconsole-785689b664-pxht7 1/1 Running 1 6h 10.2.2.3 upgrade3-wmeng-master-etcd-1 openshift-web-console webconsole-785689b664-sprwl 1/1 Running 1 6h 10.2.4.3 upgrade3-wmeng-master-etcd-2
This is the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1616840 I've connected to the masters, downgraded the node, run the failing command and etcd service has been restored. We need to get the fix from that bug in and test again. That should be in the next build or if you want to clone master branch from github.
fixed. openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch Kernel Version: 3.10.0-862.11.6.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
Still hit it on openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch. Re-open it to keep track it. A tough workaround is to "restart dnsmasq service during this task verify etcd cluster".
https://github.com/openshift/openshift-ansible/pull/9922
Hit this on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch, and reproduce ration is very high.
After re-run the upgrade job against the last failed job, [etcd : Verify cluster is healthy pre-upgrade] is passed, but failed at the following task: TASK [openshift_node : Approve node certificates when bootstrapping] *********** <--snip--> FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left). fatal: [qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com -> qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com]: FAILED! => {"attempts": 30, "changed": false, "msg": "The connection to the server qe-jialiu310-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "state": "unknown"} Master api static pod is restart against and again, log as the following: I0906 09:26:12.355155 1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu310-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000} F0906 09:26:22.356339 1 start_api.go:68] context deadline exceeded That make my whole upgrade harder.
CI tests are failing on #9922 for some reason. I'll look into it today.
I've added a retry loop around our etcd health check so that it retries every 6 seconds for 180 seconds. https://github.com/openshift/openshift-ansible/pull/10026 If the problem still persists after that and we see signs that it's tied to DNS resolution we should track that as part of https://bugzilla.redhat.com/show_bug.cgi?id=1624448
The PR 10026 has been merged to openshift-ansible-3.11.3-1,please check the bug.
blocked by bug 1628730
Fixed. openshift-ansible-3.11.5-1.git.0.5a01a3c.el7_5.noarch Kernel Version: 3.10.0-862.11.6.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
re-open as hit this issue again. openshift-ansible-3.11.9-1.git.0.63f7970.el7_5.noarch after restart dnsmasq service, cluster works
Weihua, I think that's https://bugzilla.redhat.com/show_bug.cgi?id=1624448 which has a fix merged but it hasn't been built yet.
Fixed. openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch