Bug 1616840
Summary: | upgrade failed at TASK [openshift_node : Wait for node to be ready] due to etcd connection is broken | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> |
Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> |
Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.11.0 | CC: | aos-bugs, jiajliu, jialiu, jokerman, mmccomas, wmeng, xtian |
Target Milestone: | --- | ||
Target Release: | 3.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-11 07:24:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1599428 |
Description
Weihua Meng
2018-08-16 02:00:51 UTC
Can we please get a complete journal log from the host? journalctl --no-pager > journal.log Can we also get the complete contents of /etc/origin and /etc/etcd tar czvf debug-data.tar.gz /etc/origin /etc/etcd We need to isolated when dnsmasq is restarted and what state it's in based on the journal. We need to review etcd configuration and api server configuration as well. This is also could be reproduced in fresh install, (my cluster is running behind proxy), installation is completed successfully, everything is going well, but after a docker reboot, api static pod can not come back due to etcd connection broken. Totally the same error log as master api log in comment 0. [root@qe-jialiu311-auto-hygp-men-1 ~]# ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jialiu311-auto-hygp-men-1:2379 https://qe-jialiu311-auto-hygp-men-1:2379 is unhealthy: failed to connect: context deadline exceeded Error: unhealthy cluster [root@qe-jialiu311-auto-hygp-men-1 ~]# systemctl restart dnsmasq [root@qe-jialiu311-auto-hygp-men-1 ~]# ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jialiu311-auto-hygp-men-1:2379 https://qe-jialiu311-auto-hygp-men-1:2379 is healthy: successfully committed proposal: took = 1.072853ms [root@qe-jialiu311-auto-hygp-men-1 ~]# rpm -q dnsmasq dnsmasq-2.76-5.el7.x86_64 [root@qe-jialiu311-auto-hygp-men-1 ~]# uname -r 3.10.0-862.9.1.el7.x86_64 # oc version oc v3.11.0-0.22.0 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO openshift-ansible-3.11.0-0.22.0.git.0.053546aNone.noarch (In reply to Johnny Liu from comment #5) > This is also could be reproduced in fresh install, (my cluster is running > behind proxy), installation is completed successfully, everything is going > well, but after a docker reboot, api static pod can not come back due to > etcd connection broken. Totally the same error log as master api log in > comment 0. > I'm looking into the hosts from comment #4, these hosts are atomic host. https://bugzilla.redhat.com/show_bug.cgi?id=1617976 may be related to the problem on atomic host but I'm not sure. These hosts have very Johnny, Is your host also atomic host too? Specific to atomic host, this bug is also relevant and the version of container-selinux on the hosts from comment #4 are definitely affected by this bug which is known to cause serious problems with dnsmasq in 3.10 and later. https://bugzilla.redhat.com/show_bug.cgi?id=1591281 In order to rule this out we should ensure that prior to installation of 3.10 we're running Atomic Host 7.5.2 which should have the fix for that bug. In both of these clusters dnsmasq seemed to be configured to route queries to 127.0.0.1:53 and the sdn pod which binds to 127.0.0.1:53 was in a failed state. Sending SIGUSR1 signal to dnsmasq causes it to dump stats and it shows 14089 failed queries to 127.0.0.1. Aug 24 06:01:22 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: using nameserver 127.0.0.1#53 for domain in-addr.arpa Aug 24 06:01:22 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: using nameserver 127.0.0.1#53 for domain cluster.local Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: time 1535131517 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: cache size 10000, 0/3101 cache insertions re-used unexpired cache entries. Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: queries forwarded 43172, queries answered locally 16293 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: queries for authoritative zones 0 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.3#53: queries sent 13579, retried or failed 0 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.2#53: queries sent 6237, retried or failed 0 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.11#53: queries sent 5664, retried or failed 0 Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 127.0.0.1#53: queries sent 28200, retried or failed 14089 When I look at the sdn pod for the master I see that it's been restarted 102 times over 8h. Looking at the logs from a failed sdn pod we see that it's looping because the api is not up. The API is not up because dnsmasq has gotten wedged, so we've got a deadlock. # docker logs 2fa3f347e55c 2018/08/24 17:27:55 socat[3603] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused User "sa" set. Context "default/qe-jialiu311-auto-hygp-men-1:8443/system:admin" modified. I0824 17:27:56.011156 3586 start_network.go:189] Reading node configuration from /etc/origin/node/node-config.yaml I0824 17:27:56.013544 3586 start_network.go:196] Starting node networking qe-jialiu311-auto-hygp-men-1 (v3.11.0-0.22.0) W0824 17:27:56.013829 3586 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP. I0824 17:27:56.013948 3586 feature_gate.go:230] feature gates: &{map[]} I0824 17:27:56.015499 3586 transport.go:160] Refreshing client certificate from store I0824 17:27:56.015588 3586 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem". I0824 17:27:56.030529 3586 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "qe-jialiu311-auto-hygp-men-1" (IP ""), iptables sync period "30s" I0824 17:28:06.044128 3586 node.go:289] Starting openshift-sdn network plugin F0824 17:28:16.052148 3586 network.go:46] SDN node startup failed: failed to validate network configuration: cannot fetch "default" cluster network: Get https://qe-jialiu311-auto-hygp-men-1:8443/apis/network.openshift.io/v1/clusternetworks/default: dial tcp [fe80::f816:3eff:fec7:d933%eth0]:8443: connect: connection refused I wonder if when sdn pod is terminating abnormally it should send a dbus message to dnsmasq to clear out the 127.0.0.1 entry. I was able to restore dnsmasq to normal operation this way. /usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (In reply to Scott Dodson from comment #6) > (In reply to Johnny Liu from comment #5) > > This is also could be reproduced in fresh install, (my cluster is running > > behind proxy), installation is completed successfully, everything is going > > well, but after a docker reboot, api static pod can not come back due to > > etcd connection broken. Totally the same error log as master api log in > > comment 0. > > > > I'm looking into the hosts from comment #4, these hosts are atomic host. > https://bugzilla.redhat.com/show_bug.cgi?id=1617976 may be related to the > problem on atomic host but I'm not sure. These hosts have very > > Johnny, > > Is your host also atomic host too? I tried both atomic host and rhel host, both can reproduce such issue. Glad to see you get my environment seen from the following comments. If need QE preserver some env for your more debug, let us know. I was trying to add the dbus send to the DS and I've discovered that the node cannot patch the pod object. This seems to be related to the fact that we've upgraded the kubelet ahead of the control plane on control plane hosts. We should resolve that then reconsider whether we need to add the dbus-send to clear dnsmasq dbus config when the pod exits. I still think that's a good idea, if for some reason the sdn pod were to be removed it would leave dnsmasq in a sane working state. Mike's PR here removes the node upgrade, working on testing that. https://github.com/openshift/openshift-ansible/pull/9758 Remove testblocker because we did not hit this issue in v3.11.0-0.24.0 fixed. openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch Kernel Version: 3.10.0-862.11.6.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |