Description of problem: Internal (svc.cluster.local) hostname lookups are delayed up to 20 seconds when master node becomes unavailable. Version-Release number of selected component (if applicable): OCP v4.6 How reproducible: every time. Steps to Reproduce: 1. rsh into testing pod 2. time internal cluster.svc.local hostname lookups in repeating loop 3. reboot any master node Actual results: Shortly after loss of master node, DNS lookups are delayed between 5 to 20 seconds. Expected results: Internal DNS lookups continue to return within 10 seconds so applications do not timeout resolving hostnames (eg. "UnknownHostException") Additional info: dns container: ~~~ Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5 Readiness: http-get http://:8080/health delay=10s timeout=10s period=10s #success=1 #failure=3 ~~~ $ while (true); do time getent hosts elasticsearch.openshift-logging.svc.cluster.local; sleep 1; done 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.038s user 0m0.001s sys 0m0.005s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.017s user 0m0.001s sys 0m0.004s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.035s user 0m0.002s sys 0m0.004s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m20.026s user 0m0.002s sys 0m0.004s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m13.286s user 0m0.000s sys 0m0.006s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m5.105s user 0m0.002s sys 0m0.005s real 0m8.206s user 0m0.003s sys 0m0.004s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m5.201s user 0m0.001s sys 0m0.005s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m8.283s user 0m0.000s sys 0m0.005s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m6.155s user 0m0.000s sys 0m0.005s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.015s user 0m0.000s sys 0m0.005s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.020s user 0m0.002s sys 0m0.004s 172.30.197.59 elasticsearch.openshift-logging.svc.cluster.local real 0m0.019s user 0m0.001s sys 0m0.004s
This is likely the same issue as bug 1884053. We use a daemonset to deploy CoreDNS, which provides the cluster DNS service, and we need https://github.com/kubernetes/kubernetes/pull/96129 for graceful termination of daemonsets on node shutdown.
There are a couple issues related to node outages and DNS: 1. Planned outage, where a node is deliberately shut down or rebooted and has the opportunity to drain pods. 2. Unplanned outage, where a node suddenly fails or is disconnected. In both cases, the problem is that a DNS pod becomes unavailable, and the DNS service needs to stop forwarding connections to the unavailable pod. This is generally achieved by deleting the pod or setting its status to "NotReady". Bug 1884053 addresses planned outages through changes in the kubelet to implement proper draining behavior on node shutdown. To address unplanned outages, we rely on readiness probes. The DNS pod's readiness pod has the following parameters: readinessProbe: initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 timeoutSeconds: 10 That is, the kubelet sends a probe every 10 seconds, with a 10-second timeout, and if 3 probes fail in a row, then the pod is marked as "NotReady". As a pod could fail immediately after a probe, and then the third probe after that must time out before the kubelet marks the pod as "NotReady", this means that it could theoretically take up to 3*10+10 seconds to observe that a DNS pod is unavailable. We are looking at better parameter values to use for the readiness probe. However, we could have a complete node failure where the kubelet is unable to perform probes or update the pod's status. It is not clear whether the readiness probe can help in this scenario. We're currently investigating the scenario and possible solutions.
To recap the most recent findings, the pod's readiness probe doesn't help in this scenario. During normal operation, the node's kubelet sends readiness probes per the pod's configuration and updates the pod's status according to the responses. In the case of complete node failure, the node's kubelet is no longer running, which means that then nothing is sending readiness probes to the pod or updating its status. Because nothing updates the pod's status, kube-proxy keeps forwarding some portion of DNS queries to the dead pod. Eventually, after about 30 to 40 seconds, Kubernetes detects that the node is unresponsive and updates the node's status to "NotReady". At this point, the pod's "Ready" condition is immediately set to false, and the pod's endpoint is marked as not-ready, which causes kube-proxy to stop sending traffic to that pod. During the period between the time when the node fails and the time when the node failure is detected, some DNS queries succeed, and some fail. I have been looking into a couple possible solutions: 1. We could write a new controller that would repeatedly send queries to each DNS pod and, if a query timed out, patch the pod's "Ready" condition to be false. If the node were alive, then the kubelet would revert the change. If the node had failed, then the pod's status would continue to report that the pod were not ready, and the endpoints controller would remove the dead pod's IP address from the DNS endpoints, which would cause kube-proxy to stop forwarding queries to the dead pod. This should reduce the period of interruption in DNS service to the time it would take for this new controller to detect failure (around 5 to 10 seconds) and patch the pod's status. 2. On bare metal clusters, each node runs a CoreDNS pod on the node host network to provide name resolution of node host names. It might be possible to configure workload pods to use this node-local DNS pod, and then configure the node-local DNS pod to forward to the cluster's DNS service. I'll continue looking into the above as possible solutions to preventing interruptions to DNS service in the case of node failure.
*** Bug 1903451 has been marked as a duplicate of this bug. ***
I have identified a change in kube-proxy that can resolve the problem. I was able to reproduce the DNS lookup failures during a node failure without the change, and I do not see any DNS lookup failures with the change applied. This change will fix the problem for clusters that use openshift-sdn. (This change is an interim solution. For a more general solution that works on both openshift-sdn and on ovn-kubernetes, we are looking at the "internal traffic policy" feature. However, internal traffic policy is still in development, so we will go with the interim solution until the general solution is possible.) I understand that a solution is needed for 4.6. Following our usual process, we'll need to make the change first in 4.8, and then backport it to 4.7.z and then to 4.6.z, so I am setting the target release for this BZ to 4.8.0, and we will open separate BZs for 4.7.z and 4.6.z as we go through the backport process.
As an additional change, in order to reduce disruption if the node does not fail but the CoreDNS pod on the node becomes unresponsive, we will modify the CoreDNS readiness probe's parameters to detect failures more quickly. This change is not strictly required for the kube-proxy change, but it will reduce the impact on the node if its local DNS pod were to fail.
We are tracking the 4.7.z backport with bug 1930913 and the 4.6.z backport with bug 1930917. Both backports are currently being tested. The backports do not need to wait for 4.8 to be released; the backports only need to wait for the change to be verified on the 4.8, 4.7, and 4.6 development branches.
verified with 4.8.0-0.nightly-2021-02-24-021848 and passed. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-02-24-021848 True False 34m Cluster version is 4.8.0-0.nightly-2021-02-24-021848 ### in first terminal, create a test pod and run below commands: $ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json $ $ oc rsh hello-pod / # / # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.01s user 0m 0.00s sys 0m 0.00s <---snip---> ### open another terminal and reboot one of the master node $ oc debug node/ci-ln-8gxix4t-002ac-dgbbf-master-2 Starting pod/ci-ln-8gxix4t-002ac-dgbbf-master-2-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# reboot ### check the first terminal and ensure no any DNS lookup delay during the master is rebooting (about 3-5 min) <---snip---> 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s <---snip---> ### check dns readiness probe has been changed as below: $ oc -n openshift-dns get ds/dns-default -oyaml <---snip---> readinessProbe: failureThreshold: 3 httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 3 <------see https://github.com/openshift/cluster-dns-operator/pull/234 successThreshold: 1 timeoutSeconds: 3 <------see https://github.com/openshift/cluster-dns-operator/pull/234
Hi, does this bug require doc text?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438