Bug 1624448 - master static pod is restart again and again due to etcd connection is broken after a docker restart
Summary: master static pod is restart again and again due to etcd connection is broken...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1623145
TreeView+ depends on / blocked
 
Reported: 2018-08-31 15:40 UTC by Johnny Liu
Modified: 2022-08-04 22:20 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-12-21 15:23:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 21021 0 'None' closed [release-3.11] Restore graceful shutdown of DNS server 2021-01-14 01:13:27 UTC
Origin (Github) 21009 0 None None None 2018-09-18 17:02:50 UTC

Description Johnny Liu 2018-08-31 15:40:44 UTC
Description of problem:
This issue already mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1616840#c5, but the fix for BZ#1616840 seem not help my case a lot, the issue still reproduce.
Maybe https://bugzilla.redhat.com/show_bug.cgi?id=1616840#c11 still is needed.

Version-Release number of the following components:
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch
# rpm -q docker
docker-1.13.1-74.git6e3bb8e.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)
# uname -r
3.10.0-862.11.6.el7.x86_64
# rpm -q container-selinux
container-selinux-2.68-1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Run a fresh install, completed successfully
2. Restart docker service
3.

Actual results:
api static is restart again and again due to connection to etcd is broken.
master api log:
<--snip-->
I0831 15:23:05.588586       1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu311-auto1-vuko-men-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000}
F0831 15:23:15.589398       1 start_api.go:68] context deadline exceeded


Expected results:
Everything should be working fine after a docker restart

Additional info:
Once the above failure happened, a dnsmasq restart would fix it.

Comment 1 Scott Dodson 2018-08-31 16:14:16 UTC
Originally I had suggested that the sdn pod should nuke dnsmasq config during restart. That would likely work if the `openshift start networking` process exited and then the dbus-send were fired off. But after a docker restart which results in the API being offline I'm not sure it would work in this situation where the sdn pod may not even be created because there's not API available?

Moving to edge networking team.

Comment 2 Scott Dodson 2018-09-05 12:41:00 UTC
https://github.com/openshift/openshift-ansible/pull/9922 possible fix though I couldn't reproduce via the method described in this bug this is from previous testing I did as part of Bug 1616840

Comment 6 Dan Mace 2018-09-11 19:13:29 UTC
Quick update here.

The problem is name resolution of the node hostname is taking too long, causing the apiserver to fail to connect to etcd within the client timeout window. Resolution is taking too long because:

1. The apiserver container /etc/resolv.conf specifies the node IP for a nameserver (dnsmasq)
2. The apiserver container /etc/resolv.conf appends a cluster.local suffix to the unqualified hostname query because of ndots=5
3. dnsmasq delegates the cluster.local lookup to 127.0.0.1:53 (skyDNS via sdn pod, which isn't running yet)
4. The query to skydns times out and dnsmasq ultimately services the query via the myhostname plugin specified in the nsswitch conf on the node, but by then (10-15s later) the apiserver etcd client timeout window has expired

Re-ordering 'myhostname' in nsswitch.conf before 'dns' and restarting dnsmasq unwedges everything, and while that's instructive for diagnosis, it's not yet clear changing nsswitch.conf is the right solution.

Another solution Scott played around with was removing the SkyDNS upstream when we "know" the SDN pod isn't actually running.

Still investigating.

Comment 7 Dan Mace 2018-09-11 20:14:40 UTC
Another point which I think Scott has already articulated elsewhere:

1. During a fresh bootstrap (e.g. kill all containers, restart dnsmasq), apiserver can resolve hostname in time (dnsmasq doesn't contain an SDN upstream); apiserver starts, sdn then starts and registers upstream in dnsmasq, everything works
2. Kill all containers, do not restart dnsmasq: the SDN upstream remains in dnsmasq
3. apiserver times out trying to resolve the hostname through the now dead SDN upstream via dnsmasq
4. sdn crashes timing out resolving the hostname while looking up node config

Restarting dnsmasq and then restarting the containers fixes it.

Comment 8 Weibin Liang 2018-09-12 17:08:38 UTC
More update:

1. This problem happens in QE openstack setup only, can not reproduce it in AWS and OCP.

2. Not just "systemctl restart docker" cause this problem, "oc delete ds sdn and oc delete ds ovs" will cause same dns resolution problem too.

Comment 9 Johnny Liu 2018-09-13 01:50:05 UTC
Yeah, this issue currently is easy to be reproduced in QE's openstack setup. Also agree "systemctl restart docker" maybe not the only way to reproduce this problem. Whatever, this bug actually is kind of testblocker for QE's testing.

QE have several test cases need "restart docker service" operation, and continue to run some other testing. This bug bring a lot of trouble for these testing.

Also QE have some test cases about master/node scale up for cluster running behind proxy (which only could be run on QE's fully-controlled openstack), I also hit such issue when doing master/node scale up.

"Restarting dnsmasq and then restarting the containers fixes it" is not some acceptable workaround for me, because when playbook running, once this is happened, the whole playbook running would be broken.

I saw this is moved to 3.11.z, based on this bug is kind of testblocker, I set target release back to 3.11.0.

Comment 10 Casey Callendrello 2018-09-13 12:25:38 UTC
Taking a step back.

Does anything in this bug indicate that, in the event of a cold-restart scenario (cluster reboot / datacenter power loss / etc), the control plane wouldn't be able to access etcd?

Remember, when the SDN is down, service IPs don't work.

This is relevant because the SDN is installed via daemonsets. If, in recovery scenarios, SDN pods can't be scheduled until the SDN is up, then we're in trouble.

Comment 11 Dan Mace 2018-09-13 14:14:33 UTC
(In reply to Casey Callendrello from comment #10)
> Taking a step back.
> 
> Does anything in this bug indicate that, in the event of a cold-restart
> scenario (cluster reboot / datacenter power loss / etc), the control plane
> wouldn't be able to access etcd?
> 
> Remember, when the SDN is down, service IPs don't work.
> 
> This is relevant because the SDN is installed via daemonsets. If, in
> recovery scenarios, SDN pods can't be scheduled until the SDN is up, then
> we're in trouble.

Just to avoid leaving this hanging out there... in a cold boot scenario, dnsmasq contains no lingering dead skydns upstream for cluster.local domains, so everything works. The bug manifests when bouncing containers on a long running node without clearing the upstreams from dnsmasq prior to apiserver/sdn startup.

Ben had the good and simple idea to clear dnsmasq via dbus as part of the sdn container command script in the pod spec... I'll look into that.

Comment 12 Dan Mace 2018-09-13 15:40:13 UTC
Thanks to Clayton, we've discovered a rebase regression which removed dnsmasq cleanup code from the network startup:

https://github.com/openshift/origin/commit/564ee038cff669480ccecd7e999d3f41464b30ff#diff-429b1bdbc91c6c76ec749af2ea71fd04

Current theory is that if the interrupt handling is reimplemented, the deadlock can self-resolve as the sdn pod crashloops, **assuming the kubelet remains alive**. That is, given a docker restart, the issue should self resolve, **provided the kubelet is not also restarted**.

Comment 13 Dan Mace 2018-09-13 15:47:07 UTC
(In reply to Dan Mace from comment #12)
> Thanks to Clayton, we've discovered a rebase regression which removed
> dnsmasq cleanup code from the network startup:
> 
> https://github.com/openshift/origin/commit/
> 564ee038cff669480ccecd7e999d3f41464b30ff#diff-
> 429b1bdbc91c6c76ec749af2ea71fd04
> 
> Current theory is that if the interrupt handling is reimplemented, the
> deadlock can self-resolve as the sdn pod crashloops, **assuming the kubelet
> remains alive**. That is, given a docker restart, the issue should self
> resolve, **provided the kubelet is not also restarted**.

Quick clarification here, it should also resolve the deadlock if the kubelet restarts, because docker stop will cleanly shut down the sdn pod, clearing dnsmasq and allowing the apiserver to restart absent the sdn pod.

Comment 14 openshift-github-bot 2018-09-18 12:28:34 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/2229ac581fa0587ade04c52f6ba587eae28f179b
Restore graceful shutdown of DNS server

Restore signal handling and graceful shutdown of DNS that had been added in
commit 564ee038cff669480ccecd7e999d3f41464b30ff but was subsequently lost
during rebase in commit 564ee038cff669480ccecd7e999d3f41464b30ff.

Graceful shutdown is necessary to ensure that when OpenShift's DNS server
shuts down, we remove it from the dnsmasq configuration.

This commit fixes bug 1624448.

https://bugzilla.redhat.com/show_bug.cgi?id=1624448

Comment 15 Wei Sun 2018-09-20 01:50:43 UTC
The PR 21021 is merged to v3.11.10-1

Comment 16 Johnny Liu 2018-09-20 06:06:25 UTC
Verified this bug with atomic-openshift-3.11.9-1.git.0.2acf2da.el7_5 + openshift-ansible-3.11.9-1.git.0.63f7970.el7_5, and PASS.

Comment 17 Miciah Dashiel Butler Masters 2018-11-23 18:01:24 UTC
No doctext needed because the defect both was introduced (commit 564ee038cff669480ccecd7e999d3f41464b30ff) and fixed (commit 4c6c0ebd65e1f6ded381b045cfcd2498e033a815) during the 3.11 cycle:

    % git tag --contains=4c6c0ebd65e1f6ded381b045cfcd2498e033a815 \
    --sort=version:refname | head -n 1
    v3.11.0
    % git tag --contains=564ee038cff669480ccecd7e999d3f41464b30ff \
    --sort=version:refname | head -n 1
    v3.11.0

Comment 18 Luke Meyer 2018-12-21 15:23:08 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.