Bug 1616840

Summary:	upgrade failed at TASK [openshift_node : Wait for node to be ready] due to etcd connection is broken
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Cluster Version Operator	Assignee:	Scott Dodson <sdodson>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	aos-bugs, jiajliu, jialiu, jokerman, mmccomas, wmeng, xtian
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:24:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1599428

Description Weihua Meng 2018-08-16 02:00:51 UTC

Description of problem:
Open this bug according to https://bugzilla.redhat.com/show_bug.cgi?id=1612144#c7

"Wait for node to be ready", actually node was ready and running well, but master api service can not start.

[root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api
b7f61961eb34        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   3 minutes ago       Exited (255) 2 minutes ago                        k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_6
996b3943ebef        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                                "/usr/bin/pod"           11 minutes ago      Up 11 minutes                                     k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0

Checked master api log:

I0815 07:44:09.377332       1 master_config.go:539] Using the lease endpoint reconciler with TTL=15s and interval=10s
I0815 07:44:09.377438       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jliu-a-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0815 07:44:19.379486       1 start_api.go:68] context deadline exceeded

It seems that etcd client can not connect etcd server with hostname:2379 correctly. And with ip:2379, it works well.

[root@qe-jliu-a-master-etcd-1 ~]# cat /etc/origin/master/master-config.yaml |grep -A 2 urls  urls:
  - https://qe-jliu-a-master-etcd-1:2379
etcdStorageConfig:

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379
https://qe-jliu-a-master-etcd-1:2379 is unhealthy: failed to connect: context deadline exceeded
Error:  unhealthy cluster

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://10.240.0.123:2379
https://10.240.0.123:2379 is healthy: successfully committed proposal: took = 1.279727ms


But change master-config to use etcd ip, then master api service start.

[root@qe-jliu-a-master-etcd-1 ~]# docker ps -a|grep master-api
2c6181ba9f66        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   44 seconds ago       Up 43 seconds                                         k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_9
b52f02c92012        7f9f698aa5c8                                                                                                                                        "/bin/bash -c '#!/..."   6 minutes ago        Exited (255) 5 minutes ago                            k8s_api_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_8
996b3943ebef        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.11.0-0.14.0                                                                                "/usr/bin/pod"           24 minutes ago       Up 24 minutes                                         k8s_POD_master-api-qe-jliu-a-master-etcd-1_kube-system_78f241e998dc71d63daee76102c5ef51_0

Or, restart dnsmasq, master api can start correctly.

Moreover, before upgrade(v3.10), with etcd hostname, client can connect server correctly.

[root@qe-jliu-a-master-etcd-1 ~]# ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jliu-a-master-etcd-1:2379
https://qe-jliu-a-master-etcd-1:2379 is healthy: successfully committed proposal: took = 1.301758ms

Version-Release number of the following components:
openshift-ansible-3.11.0-0.14.0.git.0.7bd4429None.noarch

How reproducible:
75%

Steps to Reproduce:
1. upgrade OCP v3.10 to v3.11

Actual results:
upgrade failed at TASK [openshift_node : Wait for node to be ready]

Expected results:
Upgrade succeeds

Comment 1 Scott Dodson 2018-08-16 13:26:17 UTC

Can we please get a complete journal log from the host?
journalctl --no-pager > journal.log

Can we also get the complete contents of /etc/origin and /etc/etcd

tar czvf debug-data.tar.gz /etc/origin /etc/etcd

We need to isolated when dnsmasq is restarted and what state it's in based on the journal. We need to review etcd configuration and api server configuration as well.

Comment 5 Johnny Liu 2018-08-24 09:57:56 UTC

This is also could be reproduced in fresh install, (my cluster is running behind proxy), installation is completed successfully, everything is going well, but after a docker reboot, api static pod can not come back due to etcd connection broken. Totally the same error log as master api log in comment 0.

[root@qe-jialiu311-auto-hygp-men-1 ~]# ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jialiu311-auto-hygp-men-1:2379
https://qe-jialiu311-auto-hygp-men-1:2379 is unhealthy: failed to connect: context deadline exceeded
Error:  unhealthy cluster
[root@qe-jialiu311-auto-hygp-men-1 ~]# systemctl restart dnsmasq
[root@qe-jialiu311-auto-hygp-men-1 ~]# ETCDCTL_API=3 etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt endpoint health --endpoints https://qe-jialiu311-auto-hygp-men-1:2379
https://qe-jialiu311-auto-hygp-men-1:2379 is healthy: successfully committed proposal: took = 1.072853ms

[root@qe-jialiu311-auto-hygp-men-1 ~]# rpm -q dnsmasq
dnsmasq-2.76-5.el7.x86_64

[root@qe-jialiu311-auto-hygp-men-1 ~]# uname -r
3.10.0-862.9.1.el7.x86_64

# oc version
oc v3.11.0-0.22.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift-ansible-3.11.0-0.22.0.git.0.053546aNone.noarch

Comment 6 Scott Dodson 2018-08-24 14:27:15 UTC

(In reply to Johnny Liu from comment #5)
> This is also could be reproduced in fresh install, (my cluster is running
> behind proxy), installation is completed successfully, everything is going
> well, but after a docker reboot, api static pod can not come back due to
> etcd connection broken. Totally the same error log as master api log in
> comment 0.
> 

I'm looking into the hosts from comment #4, these hosts are atomic host.
https://bugzilla.redhat.com/show_bug.cgi?id=1617976 may be related to the problem on atomic host but I'm not sure. These hosts have very 

Johnny,

Is your host also atomic host too?

Comment 7 Scott Dodson 2018-08-24 15:22:07 UTC

Specific to atomic host, this bug is also relevant and the version of container-selinux on the hosts from comment #4 are definitely affected by this bug which is known to cause serious problems with dnsmasq in 3.10 and later.

https://bugzilla.redhat.com/show_bug.cgi?id=1591281

In order to rule this out we should ensure that prior to installation of 3.10 we're running Atomic Host 7.5.2 which should have the fix for that bug.

Comment 9 Scott Dodson 2018-08-24 17:38:37 UTC

In both of these clusters dnsmasq seemed to be configured to route queries to 127.0.0.1:53 and the sdn pod which binds to 127.0.0.1:53 was in a failed state. Sending SIGUSR1 signal to dnsmasq causes it to dump stats and it shows 14089 failed queries to 127.0.0.1.

Aug 24 06:01:22 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Aug 24 06:01:22 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: using nameserver 127.0.0.1#53 for domain cluster.local
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: time 1535131517
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: cache size 10000, 0/3101 cache insertions re-used unexpired cache entries.
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: queries forwarded 43172, queries answered locally 16293
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: queries for authoritative zones 0
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.3#53: queries sent 13579, retried or failed 0
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.2#53: queries sent 6237, retried or failed 0
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 172.16.120.11#53: queries sent 5664, retried or failed 0
Aug 24 13:25:17 qe-jialiu311-auto-hygp-men-1 dnsmasq[17634]: server 127.0.0.1#53: queries sent 28200, retried or failed 14089

When I look at the sdn pod for the master I see that it's been restarted 102 times over 8h.

Looking at the logs from a failed sdn pod we see that it's looping because the api is not up. The API is not up because dnsmasq has gotten wedged, so we've got a deadlock.

# docker logs 2fa3f347e55c
2018/08/24 17:27:55 socat[3603] E connect(5, AF=1 "/var/run/openshift-sdn/cni-server.sock", 40): Connection refused
User "sa" set.
Context "default/qe-jialiu311-auto-hygp-men-1:8443/system:admin" modified.
I0824 17:27:56.011156    3586 start_network.go:189] Reading node configuration from /etc/origin/node/node-config.yaml
I0824 17:27:56.013544    3586 start_network.go:196] Starting node networking qe-jialiu311-auto-hygp-men-1 (v3.11.0-0.22.0)
W0824 17:27:56.013829    3586 server.go:195] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0824 17:27:56.013948    3586 feature_gate.go:230] feature gates: &{map[]}
I0824 17:27:56.015499    3586 transport.go:160] Refreshing client certificate from store
I0824 17:27:56.015588    3586 certificate_store.go:131] Loading cert/key pair from "/etc/origin/node/certificates/kubelet-client-current.pem".
I0824 17:27:56.030529    3586 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "qe-jialiu311-auto-hygp-men-1" (IP ""), iptables sync period "30s"
I0824 17:28:06.044128    3586 node.go:289] Starting openshift-sdn network plugin
F0824 17:28:16.052148    3586 network.go:46] SDN node startup failed: failed to validate network configuration: cannot fetch "default" cluster network: Get https://qe-jialiu311-auto-hygp-men-1:8443/apis/network.openshift.io/v1/clusternetworks/default: dial tcp [fe80::f816:3eff:fec7:d933%eth0]:8443: connect: connection refused

I wonder if when sdn pod is terminating abnormally it should send a dbus message to dnsmasq to clear out the 127.0.0.1 entry. I was able to restore dnsmasq to normal operation this way.

/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:

Comment 10 Johnny Liu 2018-08-27 02:41:11 UTC

(In reply to Scott Dodson from comment #6)
> (In reply to Johnny Liu from comment #5)
> > This is also could be reproduced in fresh install, (my cluster is running
> > behind proxy), installation is completed successfully, everything is going
> > well, but after a docker reboot, api static pod can not come back due to
> > etcd connection broken. Totally the same error log as master api log in
> > comment 0.
> > 
> 
> I'm looking into the hosts from comment #4, these hosts are atomic host.
> https://bugzilla.redhat.com/show_bug.cgi?id=1617976 may be related to the
> problem on atomic host but I'm not sure. These hosts have very 
> 
> Johnny,
> 
> Is your host also atomic host too?

I tried both atomic host and rhel host, both can reproduce such issue. Glad to see you get my environment seen from the following comments. If need QE preserver some env for your more debug, let us know.

Comment 11 Scott Dodson 2018-08-27 15:57:47 UTC

I was trying to add the dbus send to the DS and I've discovered that the node cannot patch the pod object. This seems to be related to the fact that we've upgraded the kubelet ahead of the control plane on control plane hosts.

We should resolve that then reconsider whether we need to add the dbus-send to clear dnsmasq dbus config when the pod exits. I still think that's a good idea, if for some reason the sdn pod were to be removed it would leave dnsmasq in a sane working state.

Comment 12 Scott Dodson 2018-08-27 17:23:24 UTC

Mike's PR here removes the node upgrade, working on testing that.

https://github.com/openshift/openshift-ansible/pull/9758

Comment 13 liujia 2018-08-28 07:44:04 UTC

Remove testblocker because we did not hit this issue in v3.11.0-0.24.0

Comment 15 Weihua Meng 2018-08-29 12:14:20 UTC

fixed.
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 17 errata-xmlrpc 2018-10-11 07:24:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652