Bug 1884053
Summary: | cluster DNS experiencing disruptions during cluster upgrade in insights cluster | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Rogerio Bastos <rogbas> | |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | |
Networking sub component: | DNS | QA Contact: | Melvin Joseph <mjoseph> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aarapov, amcdermo, aos-bugs, bmcelvee, cmoore, hongli, jbeakley, jeder, jerzhang, kbsingh, kelly.brown1, lars.erhardt.extern, mapandey, mjoseph, mmasters, nraghava, pbergene, sdodson, walters, wking | |
Version: | 4.5 | Keywords: | ServiceDeliveryImpact | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1933859 1937089 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:33:30 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1933859, 1937089 |
Description
Rogerio Bastos
2020-09-30 20:48:42 UTC
What kernel version is running here? Given that you mention 5s I wonder if it is actually related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1763206 Target set to next release version while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Andrew, all nodes are running the following kernel version: Linux version 4.18.0-193.23.1.el8_2.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020 The 5sec latency is observed only during cluster upgrades, in other scenarios it responds normally, how could this be related to the BZ [1] you referenced? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206 (In reply to Rogerio Bastos from comment #3) > Andrew, > all nodes are running the following kernel version: > Linux version 4.18.0-193.23.1.el8_2.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 > (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020 > > The 5sec latency is observed only during cluster upgrades, in other > scenarios it responds normally, how could this be related to the BZ [1] you > referenced? It is/was just some speculation based on the fact you specifically mentioned 5s and the referenced bug would specifically exhibit a 5s delay. Your kernel version should have the fix and there may be zero correlation here. > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206 I've been able to reproduce the problem as follows. 1. In one terminal, choose an arbitrary DNS pod on an arbitrary node, and start sending queries to its address. oc -n openshift-dns get pods -o wide oc -n openshift-dns rsh -c dns ds/dns-default sh -c "while :; do dig @$address -p 5353 kubernetes.default.svc.cluster.local. A; sleep 1; done" 2. In a second terminal, watch the DNS endpoints. oc -n openshift-dns get endpoints -o yaml -w 3. In a third terminal, drain and reboot the node hosting the pod chosen in Step 1: oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0 The first terminal shows that DNS queries start timing out before the second terminal shows that the endpoints is updated. Here's what I believe is the problem: • The daemonset tolerates the node.kubernetes.io/unschedulable taint (if the toleration isn't already specified on the daemonset, the daemonset controller adds it[1]). • The endpoint controller only checks the pod's "Ready" status condition to determine if the pod's address is ready[2]. • Consequently, the pod's address is in the endpoints's list of ready addresses after the node's shutdown has cause the pod to become unresponsive. • Subsequently, the endpoint controller removes the pod's endpoint. I'm exploring options to improve this behavior, but this seems to be a fundamental issue with draining and daemonsets; the general issue has been raised in the past upstream but remains unresolved[3]. 1. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/daemon/util/daemonset_util.go#L90-L94 2. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/endpoint/endpoints_controller.go#L585 3. https://github.com/kubernetes/kubernetes/issues/75482 Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. See the comment in https://github.com/openshift/cluster-dns-operator/pull/205#issue-500267814: "These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else." Actually, I tested both build with/without the fix but didn't observe significant improvements on the latency during node reboot. Always seeing two error messages "connection timed out; no servers could be reached" before the endpoints move the pod to "notReadyAddresses". Though the readiness probe has been updated to 8181/ready and terminationGracePeriodSeconds extended to 120s, but based on above test results, I'd like to assign the BZ back for confirmation. Hi @Miciah, if that is the results we expect then please move BZ back to ON_QA. Thanks. Hmm. So one thing we could do perhaps is change the MCO's drain to actively delete at least some daemonset pods? Actually wait...this is triggering a memory, ah yes: https://github.com/openshift/machine-config-operator/pull/1491 *** Bug 1893360 has been marked as a duplicate of this bug. *** To follow up on the timeline above: oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0 This is close to what the MCO does today. One notable thing is that the shutdown should result in SIGTERM to the active daemonset pods. However, I just noticed DNS doesn't use `hostNetwork: true` (it can't I guess). hat's going to mean it's dependent on the SDN, and we're going to be stopping those at the same time...so any attempt to handle SIGTERM by e.g. reaching out the apiserver marking it inactive is going to potentially race and fail. An ordering dependency on shutdown for the pods is messy...kubernetes isn't systemd ;) Perhaps the crude hack is just to make the SDN explicitly last. OK looking a bit more carefully there's no SIGTERM handling going on here - it's just about the plain old kube Service object installed. We need to delete the pod to update that. I propose that we merge https://github.com/openshift/machine-config-operator/pull/1491 and change this service to use it. I thought more about this and concluded we need https://github.com/kubernetes/kubernetes/pull/96129 Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. As the proposed PR was recently reverted (https://github.com/openshift/cluster-dns-operator/pull/213) adding UpcomingSprint to look into this again. I hope to find time to test https://github.com/kubernetes/kubernetes/pull/96129 but have been focused on bug 1919737. This bug and bug 1919737 describe different scenarios (clean reboot versus node failure) but might be solved by the same fix. However, I will leave both bugs open for now. Reproducing the specific problem that https://github.com/openshift/cluster-dns-operator/pull/237 addresses is difficult as it involves a race condition and probably cannot be reproduced except under extreme loads. The verify this BZ, please verify that the DNS pods in the openshift-dns namespace have the readiness probe configured to use the "/ready" path: "readinessProbe": { "failureThreshold": 3, "httpGet": { "path": "/ready", "port": 8181, "scheme": "HTTP" }, "initialDelaySeconds": 10, "periodSeconds": 3, "successThreshold": 1, "timeoutSeconds": 3 }, Verified in 4.8.0-0.nightly-2021-03-10-053945 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-10-053945 True False 67m Cluster version is 4.8.0-0.nightly-2021-03-10-053945 $ oc -n openshift-dns get pod/dns-default-ddzx7 -oyaml <---snip---> readinessProbe: failureThreshold: 3 httpGet: path: /ready <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237 port: 8181 <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 3 successThreshold: 1 timeoutSeconds: 3 <---snip---> Hi, does this bug require doc text? If so, please update the doc text field. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |