Bug 1623145

Summary:	upgrade failed at TASK [etcd : Verify cluster is healthy pre-upgrade]
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Cluster Version Operator	Assignee:	Scott Dodson <sdodson>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	aos-bugs, dmace, jiajliu, jialiu, jokerman, mmccomas, wmeng, wsun
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-26 15:51:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1624448
Bug Blocks:

Description Weihua Meng 2018-08-28 14:53:54 UTC

Description of problem:
upgrade failed at TASK [etcd : Verify cluster is healthy pre-upgrade]

likely caused by ovs pods in terminating status

Version-Release number of the following components:
openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch

How reproducible:
50%

Steps to Reproduce:
1. upgrade OCP v3.10 HA cluster to v3.11

Actual results:
TASK [etcd : Verify cluster is healthy pre-upgrade] ****************************
task path: /home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/etcd/tasks/upgrade_static.yml:6
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> ESTABLISH SSH CONNECTION FOR USER: root
<upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave2/.ansible/cp/%C upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com> (1, '\n{"changed": true, "end": "2018-08-28 06:36:08.744274", "stdout": "cluster may be unhealthy: failed to list members", "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://upgrade3-wmeng-master-etcd-1:2379", "cluster-health"], "failed": true, "delta": "0:00:02.217725", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout\\n\\nerror #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", "rc": 4, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C https://upgrade3-wmeng-master-etcd-1:2379 cluster-health", "removes": null, "argv": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2018-08-28 06:36:06.526549", "msg": "non-zero return code"}\n', '')
fatal: [upgrade3-wmeng-master-etcd-1.0828-wz0.qe.rhcloud.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "/usr/local/bin/master-exec", 
        "etcd", 
        "etcd", 
        "etcdctl", 
        "--cert-file", 
        "/etc/etcd/peer.crt", 
        "--key-file", 
        "/etc/etcd/peer.key", 
        "--ca-file", 
        "/etc/etcd/ca.crt", 
        "-C", 
        "https://upgrade3-wmeng-master-etcd-1:2379", 
        "cluster-health"
    ], 
    "delta": "0:00:02.217725", 
    "end": "2018-08-28 06:36:08.744274", 
    "invocation": {
        "module_args": {
            "_raw_params": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C https://upgrade3-wmeng-master-etcd-1:2379 cluster-health", 
            "_uses_shell": false, 
            "argv": null, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 4, 
    "start": "2018-08-28 06:36:06.526549", 
    "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout\n\nerror #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", 
    "stderr_lines": [
        "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout", 
        "", 
        "error #0: client: endpoint https://upgrade3-wmeng-master-etcd-1:2379 exceeded header timeout"
    ], 
    "stdout": "cluster may be unhealthy: failed to list members", 
    "stdout_lines": [
        "cluster may be unhealthy: failed to list members"
    ]
}
	to retry, use: --limit @/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry

Expected results:
upgrade succeeds

Additional info:
[root@upgrade3-wmeng-master-etcd-2 ~]# oc get pod --all-namespaces -o wide
NAMESPACE                           NAME                                              READY     STATUS        RESTARTS   AGE       IP             NODE
default                             docker-registry-1-xqmdl                           1/1       Running       2          6h        10.2.6.5       upgrade3-wmeng-node-infra-1
default                             registry-console-1-7xb8b                          1/1       Running       0          6h        10.2.2.2       upgrade3-wmeng-master-etcd-1
default                             router-1-642b9                                    1/1       Running       0          6h        10.240.0.151   upgrade3-wmeng-node-infra-2
default                             router-1-c2tgt                                    1/1       Running       0          6h        10.240.0.150   upgrade3-wmeng-node-infra-1
install-test                        mongodb-1-np62b                                   1/1       Running       3          6h        10.2.12.9      upgrade3-wmeng-node-1
install-test                        nodejs-mongodb-example-1-build                    0/1       Completed     0          6h        10.2.12.3      upgrade3-wmeng-node-1
install-test                        nodejs-mongodb-example-1-rbqdk                    1/1       Running       2          6h        10.2.12.8      upgrade3-wmeng-node-1
kube-service-catalog                apiserver-2fb5b                                   1/1       Running       0          6h        10.2.4.5       upgrade3-wmeng-master-etcd-2
kube-service-catalog                apiserver-vk5vb                                   1/1       Running       0          6h        10.2.2.4       upgrade3-wmeng-master-etcd-1
kube-service-catalog                apiserver-vtfdg                                   1/1       Running       1          6h        10.2.0.6       upgrade3-wmeng-master-etcd-3
kube-service-catalog                controller-manager-8x75f                          1/1       Running       0          6h        10.2.2.5       upgrade3-wmeng-master-etcd-1
kube-service-catalog                controller-manager-gb9jh                          1/1       Running       0          6h        10.2.4.6       upgrade3-wmeng-master-etcd-2
kube-service-catalog                controller-manager-xlbgs                          1/1       Running       2          6h        10.2.0.7       upgrade3-wmeng-master-etcd-3
kube-system                         master-api-upgrade3-wmeng-master-etcd-1           0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-1
kube-system                         master-api-upgrade3-wmeng-master-etcd-2           0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-2
kube-system                         master-api-upgrade3-wmeng-master-etcd-3           0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-3
kube-system                         master-controllers-upgrade3-wmeng-master-etcd-1   0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-1
kube-system                         master-controllers-upgrade3-wmeng-master-etcd-2   0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-2
kube-system                         master-controllers-upgrade3-wmeng-master-etcd-3   0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-3
kube-system                         master-etcd-upgrade3-wmeng-master-etcd-1          0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-1
kube-system                         master-etcd-upgrade3-wmeng-master-etcd-2          0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-2
kube-system                         master-etcd-upgrade3-wmeng-master-etcd-3          0/1       Pending       0          4h        <none>         upgrade3-wmeng-master-etcd-3
openshift-ansible-service-broker    asb-1-z68vx                                       1/1       Running       3          6h        10.2.10.8      upgrade3-wmeng-node-infra-2
openshift-infra                     hawkular-cassandra-1-wn674                        1/1       Running       2          6h        10.2.10.9      upgrade3-wmeng-node-infra-2
openshift-infra                     hawkular-metrics-7dhsm                            1/1       Running       1          6h        10.2.8.8       upgrade3-wmeng-node-2
openshift-infra                     hawkular-metrics-schema-kg2br                     0/1       Completed     0          6h        10.2.8.2       upgrade3-wmeng-node-2
openshift-infra                     heapster-gw9tr                                    1/1       Running       0          6h        10.2.4.4       upgrade3-wmeng-master-etcd-2
openshift-metrics-server            metrics-server-84b6c5c786-8c5c8                   1/1       Running       1          4h        10.2.8.7       upgrade3-wmeng-node-2
openshift-node                      sync-j582j                                        1/1       Running       0          4h        10.240.0.153   upgrade3-wmeng-node-1
openshift-node                      sync-jrzph                                        1/1       Running       0          4h        10.240.0.151   upgrade3-wmeng-node-infra-2
openshift-node                      sync-kvdr9                                        1/1       Running       0          4h        10.240.0.154   upgrade3-wmeng-node-2
openshift-node                      sync-rgc49                                        1/1       Running       0          4h        10.240.0.150   upgrade3-wmeng-node-infra-1
openshift-node                      sync-swzd5                                        1/1       Running       0          4h        10.240.0.143   upgrade3-wmeng-master-etcd-1
openshift-node                      sync-w2d4f                                        1/1       Running       0          4h        10.240.0.144   upgrade3-wmeng-master-etcd-2
openshift-node                      sync-wdx5k                                        1/1       Running       0          4h        10.240.0.145   upgrade3-wmeng-master-etcd-3
openshift-sdn                       ovs-4wfzp                                         1/1       Running       0          4h        10.240.0.150   upgrade3-wmeng-node-infra-1
openshift-sdn                       ovs-6np82                                         1/1       Running       0          4h        10.240.0.151   upgrade3-wmeng-node-infra-2
openshift-sdn                       ovs-99dr7                                         1/1       Running       0          6h        10.240.0.144   upgrade3-wmeng-master-etcd-2
openshift-sdn                       ovs-99zzs                                         1/1       Running       0          4h        10.240.0.154   upgrade3-wmeng-node-2
openshift-sdn                       ovs-fgb2w                                         1/1       Terminating   0          6h        10.240.0.145   upgrade3-wmeng-master-etcd-3
openshift-sdn                       ovs-t9n82                                         1/1       Terminating   0          6h        10.240.0.143   upgrade3-wmeng-master-etcd-1
openshift-sdn                       ovs-tb6hd                                         1/1       Running       0          4h        10.240.0.153   upgrade3-wmeng-node-1
openshift-sdn                       sdn-2r4sn                                         1/1       Running       1          6h        10.240.0.154   upgrade3-wmeng-node-2
openshift-sdn                       sdn-74gms                                         1/1       Running       1          4h        10.240.0.151   upgrade3-wmeng-node-infra-2
openshift-sdn                       sdn-7gpt5                                         1/1       Running       0          4h        10.240.0.153   upgrade3-wmeng-node-1
openshift-sdn                       sdn-9smsx                                         1/1       Running       0          4h        10.240.0.150   upgrade3-wmeng-node-infra-1
openshift-sdn                       sdn-fwm68                                         1/1       Running       0          4h        10.240.0.145   upgrade3-wmeng-master-etcd-3
openshift-sdn                       sdn-nx4m4                                         1/1       Running       0          6h        10.240.0.144   upgrade3-wmeng-master-etcd-2
openshift-sdn                       sdn-zrzp4                                         1/1       Terminating   0          6h        10.240.0.143   upgrade3-wmeng-master-etcd-1
openshift-template-service-broker   apiserver-f78lc                                   1/1       Running       0          6h        10.2.4.7       upgrade3-wmeng-master-etcd-2
openshift-template-service-broker   apiserver-t6q84                                   1/1       Running       1          6h        10.2.0.8       upgrade3-wmeng-master-etcd-3
openshift-template-service-broker   apiserver-v2569                                   1/1       Running       0          6h        10.2.2.6       upgrade3-wmeng-master-etcd-1
openshift-web-console               webconsole-785689b664-nxwdm                       1/1       Running       2          6h        10.2.0.9       upgrade3-wmeng-master-etcd-3
openshift-web-console               webconsole-785689b664-pxht7                       1/1       Running       1          6h        10.2.2.3       upgrade3-wmeng-master-etcd-1
openshift-web-console               webconsole-785689b664-sprwl                       1/1       Running       1          6h        10.2.4.3       upgrade3-wmeng-master-etcd-2

Comment 2 Scott Dodson 2018-08-28 15:18:42 UTC

This is the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1616840

I've connected to the masters, downgraded the node, run the failing command and etcd service has been restored. We need to get the fix from that bug in and test again. That should be in the next build or if you want to clone master branch from github.

Comment 3 Weihua Meng 2018-08-29 12:15:24 UTC

fixed.
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 4 liujia 2018-09-05 03:42:30 UTC

Still hit it on openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch.

Re-open it to keep track it. A tough workaround is to "restart dnsmasq service during this task verify etcd cluster".

Comment 5 Scott Dodson 2018-09-05 20:50:02 UTC

https://github.com/openshift/openshift-ansible/pull/9922

Comment 6 Johnny Liu 2018-09-06 08:38:09 UTC

Hit this on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch, and reproduce ration is very high.

Comment 7 Johnny Liu 2018-09-06 09:34:09 UTC

After re-run the upgrade job against the last failed job, [etcd : Verify cluster is healthy pre-upgrade] is passed, but failed at the following task:

TASK [openshift_node : Approve node certificates when bootstrapping] ***********
<--snip-->
FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left).

fatal: [qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com -> qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com]: FAILED! => {"attempts": 30, "changed": false, "msg": "The connection to the server qe-jialiu310-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "state": "unknown"}

Master api static pod is restart against and again, log as the following:
I0906 09:26:12.355155       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu310-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0906 09:26:22.356339       1 start_api.go:68] context deadline exceeded


That make my whole upgrade harder.

Comment 8 Scott Dodson 2018-09-10 12:08:49 UTC

CI tests are failing on #9922 for some reason. I'll look into it today.

Comment 14 Scott Dodson 2018-09-12 13:26:32 UTC

I've added a retry loop around our etcd health check so that it retries every 6 seconds for 180 seconds.

https://github.com/openshift/openshift-ansible/pull/10026

If the problem still persists after that and we see signs that it's tied to DNS resolution we should track that as part of https://bugzilla.redhat.com/show_bug.cgi?id=1624448

Comment 15 Wei Sun 2018-09-13 05:24:57 UTC

The PR 10026 has been merged to openshift-ansible-3.11.3-1,please check the bug.

Comment 17 Weihua Meng 2018-09-14 09:16:22 UTC

blocked by bug 1628730

Comment 18 Weihua Meng 2018-09-14 09:20:44 UTC

Fixed.

openshift-ansible-3.11.5-1.git.0.5a01a3c.el7_5.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 19 Weihua Meng 2018-09-19 09:11:40 UTC

re-open as hit this issue again.
openshift-ansible-3.11.9-1.git.0.63f7970.el7_5.noarch

after restart dnsmasq service, cluster works

Comment 21 Scott Dodson 2018-09-19 11:55:41 UTC

Weihua,

I think that's https://bugzilla.redhat.com/show_bug.cgi?id=1624448 which has a fix merged but it hasn't been built yet.

Comment 22 Weihua Meng 2018-09-21 05:59:08 UTC

Fixed.

openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch