Bug 1919737 - hostname lookup delays when master node down
Summary: hostname lookup delays when master node down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
: 1903451 (view as bug list)
Depends On:
Blocks: 1928304 1930913
TreeView+ depends on / blocked
 
Reported: 2021-01-25 01:10 UTC by Brendan Shirren
Modified: 2022-08-04 22:39 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:36:44 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 234 0 None closed Bug 1919737: Set CoreDNS readiness probe period and timeout each to 3 seconds 2021-02-19 18:58:06 UTC
Github openshift sdn pull 254 0 None closed Bug 1919737: Prefer local endpoint for cluster DNS service 2021-02-23 03:39:55 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:37:12 UTC

Description Brendan Shirren 2021-01-25 01:10:42 UTC
Description of problem: Internal (svc.cluster.local) hostname lookups are delayed up to 20 seconds when master node becomes unavailable.


Version-Release number of selected component (if applicable): OCP v4.6


How reproducible: every time.


Steps to Reproduce:
1. rsh into testing pod
2. time internal cluster.svc.local hostname lookups in repeating loop
3. reboot any master node

Actual results:

Shortly after loss of master node, DNS lookups are delayed between 5 to 20 seconds.

Expected results:

Internal DNS lookups continue to return within 10 seconds so applications do not timeout resolving hostnames (eg. "UnknownHostException")

Additional info:

dns container:
~~~
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=10s timeout=10s period=10s #success=1 #failure=3
~~~


$ while (true); do time getent hosts elasticsearch.openshift-logging.svc.cluster.local; sleep 1; done

172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.038s
user	0m0.001s
sys	0m0.005s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.017s
user	0m0.001s
sys	0m0.004s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.035s
user	0m0.002s
sys	0m0.004s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m20.026s
user	0m0.002s
sys	0m0.004s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m13.286s
user	0m0.000s
sys	0m0.006s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m5.105s
user	0m0.002s
sys	0m0.005s

real	0m8.206s
user	0m0.003s
sys	0m0.004s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m5.201s
user	0m0.001s
sys	0m0.005s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m8.283s
user	0m0.000s
sys	0m0.005s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m6.155s
user	0m0.000s
sys	0m0.005s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.015s
user	0m0.000s
sys	0m0.005s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.020s
user	0m0.002s
sys	0m0.004s
172.30.197.59   elasticsearch.openshift-logging.svc.cluster.local

real	0m0.019s
user	0m0.001s
sys	0m0.004s

Comment 1 Miciah Dashiel Butler Masters 2021-01-25 04:21:02 UTC
This is likely the same issue as bug 1884053.  We use a daemonset to deploy CoreDNS, which provides the cluster DNS service, and we need https://github.com/kubernetes/kubernetes/pull/96129 for graceful termination of daemonsets on node shutdown.

Comment 6 Miciah Dashiel Butler Masters 2021-02-02 21:34:05 UTC
There are a couple issues related to node outages and DNS: 

1. Planned outage, where a node is deliberately shut down or rebooted and has the opportunity to drain pods.  

2. Unplanned outage, where a node suddenly fails or is disconnected.  

In both cases, the problem is that a DNS pod becomes unavailable, and the DNS service needs to stop forwarding connections to the unavailable pod.  This is generally achieved by deleting the pod or setting its status to "NotReady".  

Bug 1884053 addresses planned outages through changes in the kubelet to implement proper draining behavior on node shutdown.  

To address unplanned outages, we rely on readiness probes.  The DNS pod's readiness pod has the following parameters: 

        readinessProbe:
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 3
          timeoutSeconds: 10

That is, the kubelet sends a probe every 10 seconds, with a 10-second timeout, and if 3 probes fail in a row, then the pod is marked as "NotReady".  As a pod could fail immediately after a probe, and then the third probe after that must time out before the kubelet marks the pod as "NotReady", this means that it could theoretically take up to 3*10+10 seconds to observe that a DNS pod is unavailable.  We are looking at better parameter values to use for the readiness probe.  

However, we could have a complete node failure where the kubelet is unable to perform probes or update the pod's status.  It is not clear whether the readiness probe can help in this scenario.  We're currently investigating the scenario and possible solutions.

Comment 8 Miciah Dashiel Butler Masters 2021-02-05 03:53:06 UTC
To recap the most recent findings, the pod's readiness probe doesn't help in this scenario.  During normal operation, the node's kubelet sends readiness probes per the pod's configuration and updates the pod's status according to the responses.  In the case of complete node failure, the node's kubelet is no longer running, which means that then nothing is sending readiness probes to the pod or updating its status.  Because nothing updates the pod's status, kube-proxy keeps forwarding some portion of DNS queries to the dead pod.  

Eventually, after about 30 to 40 seconds, Kubernetes detects that the node is unresponsive and updates the node's status to "NotReady".  At this point, the pod's "Ready" condition is immediately set to false, and the pod's endpoint is marked as not-ready, which causes kube-proxy to stop sending traffic to that pod.  During the period between the time when the node fails and the time when the node failure is detected, some DNS queries succeed, and some fail.  

I have been looking into a couple possible solutions:

1. We could write a new controller that would repeatedly send queries to each DNS pod and, if a query timed out, patch the pod's "Ready" condition to be false.  If the node were alive, then the kubelet would revert the change.  If the node had failed, then the pod's status would continue to report that the pod were not ready, and the endpoints controller would remove the dead pod's IP address from the DNS endpoints, which would cause kube-proxy to stop forwarding queries to the dead pod.  This should reduce the period of interruption in DNS service to the time it would take for this new controller to detect failure (around 5 to 10 seconds) and patch the pod's status.  

2. On bare metal clusters, each node runs a CoreDNS pod on the node host network to provide name resolution of node host names.  It might be possible to configure workload pods to use this node-local DNS pod, and then configure the node-local DNS pod to forward to the cluster's DNS service.  

I'll continue looking into the above as possible solutions to preventing interruptions to DNS service in the case of node failure.

Comment 15 Andrew McDermott 2021-02-09 17:27:02 UTC
*** Bug 1903451 has been marked as a duplicate of this bug. ***

Comment 18 Miciah Dashiel Butler Masters 2021-02-11 03:06:21 UTC
I have identified a change in kube-proxy that can resolve the problem.  I was able to reproduce the DNS lookup failures during a node failure without the change, and I do not see any DNS lookup failures with the change applied.  This change will fix the problem for clusters that use openshift-sdn.  

(This change is an interim solution.  For a more general solution that works on both openshift-sdn and on ovn-kubernetes, we are looking at the "internal traffic policy" feature.  However, internal traffic policy is still in development, so we will go with the interim solution until the general solution is possible.)  

I understand that a solution is needed for 4.6.  Following our usual process, we'll need to make the change first in 4.8, and then backport it to 4.7.z and then to 4.6.z, so I am setting the target release for this BZ to 4.8.0, and we will open separate BZs for 4.7.z and 4.6.z as we go through the backport process.

Comment 19 Miciah Dashiel Butler Masters 2021-02-11 03:31:02 UTC
As an additional change, in order to reduce disruption if the node does not fail but the CoreDNS pod on the node becomes unresponsive, we will modify the CoreDNS readiness probe's parameters to detect failures more quickly.  This change is not strictly required for the kube-proxy change, but it will reduce the impact on the node if its local DNS pod were to fail.

Comment 22 Miciah Dashiel Butler Masters 2021-02-19 21:24:26 UTC
We are tracking the 4.7.z backport with bug 1930913 and the 4.6.z backport with bug 1930917.  Both backports are currently being tested.  The backports do not need to wait for 4.8 to be released; the backports only need to wait for the change to be verified on the 4.8, 4.7, and 4.6 development branches.

Comment 25 Hongan Li 2021-02-24 07:23:31 UTC
verified with 4.8.0-0.nightly-2021-02-24-021848 and passed.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-02-24-021848   True        False         34m     Cluster version is 4.8.0-0.nightly-2021-02-24-021848

### in first terminal, create a test pod and run below commands:
$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
$ $ oc rsh hello-pod
/ # 
/ # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s

<---snip--->

### open another terminal and reboot one of the master node

$ oc debug node/ci-ln-8gxix4t-002ac-dgbbf-master-2
Starting pod/ci-ln-8gxix4t-002ac-dgbbf-master-2-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot 

### check the first terminal and ensure no any DNS lookup delay during the master is rebooting (about 3-5 min)
<---snip---> 
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
<---snip--->

### check dns readiness probe has been changed as below:
$ oc -n openshift-dns get ds/dns-default -oyaml
<---snip--->
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3                  <------see https://github.com/openshift/cluster-dns-operator/pull/234
          successThreshold: 1
          timeoutSeconds: 3                 <------see https://github.com/openshift/cluster-dns-operator/pull/234

Comment 33 Brandi Munilla 2021-06-24 16:41:18 UTC
Hi, does this bug require doc text?

Comment 35 errata-xmlrpc 2021-07-27 22:36:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.