Description of problem: This issue already mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1616840#c5, but the fix for BZ#1616840 seem not help my case a lot, the issue still reproduce. Maybe https://bugzilla.redhat.com/show_bug.cgi?id=1616840#c11 still is needed. Version-Release number of the following components: openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch # rpm -q docker docker-1.13.1-74.git6e3bb8e.el7.x86_64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo) # uname -r 3.10.0-862.11.6.el7.x86_64 # rpm -q container-selinux container-selinux-2.68-1.el7.noarch How reproducible: Always Steps to Reproduce: 1. Run a fresh install, completed successfully 2. Restart docker service 3. Actual results: api static is restart again and again due to connection to etcd is broken. master api log: <--snip--> I0831 15:23:05.588586 1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu311-auto1-vuko-men-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000} F0831 15:23:15.589398 1 start_api.go:68] context deadline exceeded Expected results: Everything should be working fine after a docker restart Additional info: Once the above failure happened, a dnsmasq restart would fix it.
Originally I had suggested that the sdn pod should nuke dnsmasq config during restart. That would likely work if the `openshift start networking` process exited and then the dbus-send were fired off. But after a docker restart which results in the API being offline I'm not sure it would work in this situation where the sdn pod may not even be created because there's not API available? Moving to edge networking team.
https://github.com/openshift/openshift-ansible/pull/9922 possible fix though I couldn't reproduce via the method described in this bug this is from previous testing I did as part of Bug 1616840
Quick update here. The problem is name resolution of the node hostname is taking too long, causing the apiserver to fail to connect to etcd within the client timeout window. Resolution is taking too long because: 1. The apiserver container /etc/resolv.conf specifies the node IP for a nameserver (dnsmasq) 2. The apiserver container /etc/resolv.conf appends a cluster.local suffix to the unqualified hostname query because of ndots=5 3. dnsmasq delegates the cluster.local lookup to 127.0.0.1:53 (skyDNS via sdn pod, which isn't running yet) 4. The query to skydns times out and dnsmasq ultimately services the query via the myhostname plugin specified in the nsswitch conf on the node, but by then (10-15s later) the apiserver etcd client timeout window has expired Re-ordering 'myhostname' in nsswitch.conf before 'dns' and restarting dnsmasq unwedges everything, and while that's instructive for diagnosis, it's not yet clear changing nsswitch.conf is the right solution. Another solution Scott played around with was removing the SkyDNS upstream when we "know" the SDN pod isn't actually running. Still investigating.
Another point which I think Scott has already articulated elsewhere: 1. During a fresh bootstrap (e.g. kill all containers, restart dnsmasq), apiserver can resolve hostname in time (dnsmasq doesn't contain an SDN upstream); apiserver starts, sdn then starts and registers upstream in dnsmasq, everything works 2. Kill all containers, do not restart dnsmasq: the SDN upstream remains in dnsmasq 3. apiserver times out trying to resolve the hostname through the now dead SDN upstream via dnsmasq 4. sdn crashes timing out resolving the hostname while looking up node config Restarting dnsmasq and then restarting the containers fixes it.
More update: 1. This problem happens in QE openstack setup only, can not reproduce it in AWS and OCP. 2. Not just "systemctl restart docker" cause this problem, "oc delete ds sdn and oc delete ds ovs" will cause same dns resolution problem too.
Yeah, this issue currently is easy to be reproduced in QE's openstack setup. Also agree "systemctl restart docker" maybe not the only way to reproduce this problem. Whatever, this bug actually is kind of testblocker for QE's testing. QE have several test cases need "restart docker service" operation, and continue to run some other testing. This bug bring a lot of trouble for these testing. Also QE have some test cases about master/node scale up for cluster running behind proxy (which only could be run on QE's fully-controlled openstack), I also hit such issue when doing master/node scale up. "Restarting dnsmasq and then restarting the containers fixes it" is not some acceptable workaround for me, because when playbook running, once this is happened, the whole playbook running would be broken. I saw this is moved to 3.11.z, based on this bug is kind of testblocker, I set target release back to 3.11.0.
Taking a step back. Does anything in this bug indicate that, in the event of a cold-restart scenario (cluster reboot / datacenter power loss / etc), the control plane wouldn't be able to access etcd? Remember, when the SDN is down, service IPs don't work. This is relevant because the SDN is installed via daemonsets. If, in recovery scenarios, SDN pods can't be scheduled until the SDN is up, then we're in trouble.
(In reply to Casey Callendrello from comment #10) > Taking a step back. > > Does anything in this bug indicate that, in the event of a cold-restart > scenario (cluster reboot / datacenter power loss / etc), the control plane > wouldn't be able to access etcd? > > Remember, when the SDN is down, service IPs don't work. > > This is relevant because the SDN is installed via daemonsets. If, in > recovery scenarios, SDN pods can't be scheduled until the SDN is up, then > we're in trouble. Just to avoid leaving this hanging out there... in a cold boot scenario, dnsmasq contains no lingering dead skydns upstream for cluster.local domains, so everything works. The bug manifests when bouncing containers on a long running node without clearing the upstreams from dnsmasq prior to apiserver/sdn startup. Ben had the good and simple idea to clear dnsmasq via dbus as part of the sdn container command script in the pod spec... I'll look into that.
Thanks to Clayton, we've discovered a rebase regression which removed dnsmasq cleanup code from the network startup: https://github.com/openshift/origin/commit/564ee038cff669480ccecd7e999d3f41464b30ff#diff-429b1bdbc91c6c76ec749af2ea71fd04 Current theory is that if the interrupt handling is reimplemented, the deadlock can self-resolve as the sdn pod crashloops, **assuming the kubelet remains alive**. That is, given a docker restart, the issue should self resolve, **provided the kubelet is not also restarted**.
(In reply to Dan Mace from comment #12) > Thanks to Clayton, we've discovered a rebase regression which removed > dnsmasq cleanup code from the network startup: > > https://github.com/openshift/origin/commit/ > 564ee038cff669480ccecd7e999d3f41464b30ff#diff- > 429b1bdbc91c6c76ec749af2ea71fd04 > > Current theory is that if the interrupt handling is reimplemented, the > deadlock can self-resolve as the sdn pod crashloops, **assuming the kubelet > remains alive**. That is, given a docker restart, the issue should self > resolve, **provided the kubelet is not also restarted**. Quick clarification here, it should also resolve the deadlock if the kubelet restarts, because docker stop will cleanly shut down the sdn pod, clearing dnsmasq and allowing the apiserver to restart absent the sdn pod.
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/2229ac581fa0587ade04c52f6ba587eae28f179b Restore graceful shutdown of DNS server Restore signal handling and graceful shutdown of DNS that had been added in commit 564ee038cff669480ccecd7e999d3f41464b30ff but was subsequently lost during rebase in commit 564ee038cff669480ccecd7e999d3f41464b30ff. Graceful shutdown is necessary to ensure that when OpenShift's DNS server shuts down, we remove it from the dnsmasq configuration. This commit fixes bug 1624448. https://bugzilla.redhat.com/show_bug.cgi?id=1624448
The PR 21021 is merged to v3.11.10-1
Verified this bug with atomic-openshift-3.11.9-1.git.0.2acf2da.el7_5 + openshift-ansible-3.11.9-1.git.0.63f7970.el7_5, and PASS.
No doctext needed because the defect both was introduced (commit 564ee038cff669480ccecd7e999d3f41464b30ff) and fixed (commit 4c6c0ebd65e1f6ded381b045cfcd2498e033a815) during the 3.11 cycle: % git tag --contains=4c6c0ebd65e1f6ded381b045cfcd2498e033a815 \ --sort=version:refname | head -n 1 v3.11.0 % git tag --contains=564ee038cff669480ccecd7e999d3f41464b30ff \ --sort=version:refname | head -n 1 v3.11.0
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.