Description of problem: During cluster upgrade (to v4.5.11) different cluster dns issues have been identified in the insights cluster, the main one being extra latency added to intra-service API calls (~5 seconds of extra response time). While the issue was being experienced, the upgrade process was rolling out the latest version to all of the 54 nodes in cluster, by cycling thru all nodes and setting each one with the SchedulingDisabled status and back to Ready. During that process, dns and networking ClusterOperator CRs were flashing true and false for the DEGRADED status. Version-Release number of selected component (if applicable): v4.5.11 How reproducible: Always, during cluster upgrades. Actual results: 1) During cluster upgrades, more precisely during node rollout, an extra latency of 5 seconds was observed for in-cluster DNS resolution. Different NS were tried without success, examples: - http://xjoin-search.xjoin-prod.svc.cluster.local.:4000/graphql - http://xjoin-search.xjoin-prod.svc:4000/graphql 2) Also the following error message was noticed from dns-operator pods: time="2020-09-29T16:32:31Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736989186, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"ClusterIP assigned to DNS Service and minimum DaemonSet pods running\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736993789, loc:(*time.Location)(0x226ff00)}}, Reason:\"Reconciling\", Message:\"53 Nodes running a DaemonSet pod, want 54\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63728611853, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"Minimum number of Nodes running DaemonSet pod\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736989186, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"ClusterIP assigned to DNS Service and minimum DaemonSet pods running\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736993951, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"All expected Nodes running DaemonSet pod\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63728611853, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"Minimum number of Nodes running DaemonSet pod\"}}}"
What kernel version is running here? Given that you mention 5s I wonder if it is actually related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1763206
Target set to next release version while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
Andrew, all nodes are running the following kernel version: Linux version 4.18.0-193.23.1.el8_2.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020 The 5sec latency is observed only during cluster upgrades, in other scenarios it responds normally, how could this be related to the BZ [1] you referenced? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206
(In reply to Rogerio Bastos from comment #3) > Andrew, > all nodes are running the following kernel version: > Linux version 4.18.0-193.23.1.el8_2.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 > (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020 > > The 5sec latency is observed only during cluster upgrades, in other > scenarios it responds normally, how could this be related to the BZ [1] you > referenced? It is/was just some speculation based on the fact you specifically mentioned 5s and the referenced bug would specifically exhibit a 5s delay. Your kernel version should have the fix and there may be zero correlation here. > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206
I've been able to reproduce the problem as follows. 1. In one terminal, choose an arbitrary DNS pod on an arbitrary node, and start sending queries to its address. oc -n openshift-dns get pods -o wide oc -n openshift-dns rsh -c dns ds/dns-default sh -c "while :; do dig @$address -p 5353 kubernetes.default.svc.cluster.local. A; sleep 1; done" 2. In a second terminal, watch the DNS endpoints. oc -n openshift-dns get endpoints -o yaml -w 3. In a third terminal, drain and reboot the node hosting the pod chosen in Step 1: oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0 The first terminal shows that DNS queries start timing out before the second terminal shows that the endpoints is updated. Here's what I believe is the problem: • The daemonset tolerates the node.kubernetes.io/unschedulable taint (if the toleration isn't already specified on the daemonset, the daemonset controller adds it[1]). • The endpoint controller only checks the pod's "Ready" status condition to determine if the pod's address is ready[2]. • Consequently, the pod's address is in the endpoints's list of ready addresses after the node's shutdown has cause the pod to become unresponsive. • Subsequently, the endpoint controller removes the pod's endpoint. I'm exploring options to improve this behavior, but this seems to be a fundamental issue with draining and daemonsets; the general issue has been raised in the past upstream but remains unresolved[3]. 1. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/daemon/util/daemonset_util.go#L90-L94 2. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/endpoint/endpoints_controller.go#L585 3. https://github.com/kubernetes/kubernetes/issues/75482
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
See the comment in https://github.com/openshift/cluster-dns-operator/pull/205#issue-500267814: "These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else." Actually, I tested both build with/without the fix but didn't observe significant improvements on the latency during node reboot. Always seeing two error messages "connection timed out; no servers could be reached" before the endpoints move the pod to "notReadyAddresses". Though the readiness probe has been updated to 8181/ready and terminationGracePeriodSeconds extended to 120s, but based on above test results, I'd like to assign the BZ back for confirmation. Hi @Miciah, if that is the results we expect then please move BZ back to ON_QA. Thanks.
Hmm. So one thing we could do perhaps is change the MCO's drain to actively delete at least some daemonset pods? Actually wait...this is triggering a memory, ah yes: https://github.com/openshift/machine-config-operator/pull/1491
*** Bug 1893360 has been marked as a duplicate of this bug. ***
To follow up on the timeline above: oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0 This is close to what the MCO does today. One notable thing is that the shutdown should result in SIGTERM to the active daemonset pods. However, I just noticed DNS doesn't use `hostNetwork: true` (it can't I guess). hat's going to mean it's dependent on the SDN, and we're going to be stopping those at the same time...so any attempt to handle SIGTERM by e.g. reaching out the apiserver marking it inactive is going to potentially race and fail. An ordering dependency on shutdown for the pods is messy...kubernetes isn't systemd ;) Perhaps the crude hack is just to make the SDN explicitly last.
OK looking a bit more carefully there's no SIGTERM handling going on here - it's just about the plain old kube Service object installed. We need to delete the pod to update that. I propose that we merge https://github.com/openshift/machine-config-operator/pull/1491 and change this service to use it.
I thought more about this and concluded we need https://github.com/kubernetes/kubernetes/pull/96129
As the proposed PR was recently reverted (https://github.com/openshift/cluster-dns-operator/pull/213) adding UpcomingSprint to look into this again.
I hope to find time to test https://github.com/kubernetes/kubernetes/pull/96129 but have been focused on bug 1919737. This bug and bug 1919737 describe different scenarios (clean reboot versus node failure) but might be solved by the same fix. However, I will leave both bugs open for now.
Reproducing the specific problem that https://github.com/openshift/cluster-dns-operator/pull/237 addresses is difficult as it involves a race condition and probably cannot be reproduced except under extreme loads. The verify this BZ, please verify that the DNS pods in the openshift-dns namespace have the readiness probe configured to use the "/ready" path: "readinessProbe": { "failureThreshold": 3, "httpGet": { "path": "/ready", "port": 8181, "scheme": "HTTP" }, "initialDelaySeconds": 10, "periodSeconds": 3, "successThreshold": 1, "timeoutSeconds": 3 },
Verified in 4.8.0-0.nightly-2021-03-10-053945 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-10-053945 True False 67m Cluster version is 4.8.0-0.nightly-2021-03-10-053945 $ oc -n openshift-dns get pod/dns-default-ddzx7 -oyaml <---snip---> readinessProbe: failureThreshold: 3 httpGet: path: /ready <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237 port: 8181 <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 3 successThreshold: 1 timeoutSeconds: 3 <---snip--->
Hi, does this bug require doc text? If so, please update the doc text field.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438