Bug 1884053

Summary:	cluster DNS experiencing disruptions during cluster upgrade in insights cluster
Product:	OpenShift Container Platform	Reporter:	Rogerio Bastos <rogbas>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	DNS	QA Contact:	Melvin Joseph <mjoseph>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aarapov, amcdermo, aos-bugs, bmcelvee, cmoore, hongli, jbeakley, jeder, jerzhang, kbsingh, kelly.brown1, lars.erhardt.extern, mapandey, mjoseph, mmasters, nraghava, pbergene, sdodson, walters, wking
Version:	4.5	Keywords:	ServiceDeliveryImpact
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1933859 1937089 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:33:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1933859, 1937089

Description Rogerio Bastos 2020-09-30 20:48:42 UTC

Description of problem:
During cluster upgrade (to v4.5.11) different cluster dns issues have been identified in the insights cluster, the main one being extra latency added to intra-service API calls (~5 seconds of extra response time).

While the issue was being experienced, the upgrade process was rolling out the latest version to all of the 54 nodes in cluster, by cycling thru all nodes and setting each one with the SchedulingDisabled status and back to Ready.

During that process, dns and networking ClusterOperator CRs were flashing true and false for the DEGRADED status.

Version-Release number of selected component (if applicable):
v4.5.11

How reproducible:
Always, during cluster upgrades.


Actual results:
1) During cluster upgrades, more precisely during node rollout, an extra latency of 5 seconds was observed for in-cluster DNS resolution. Different NS were tried without success, examples:

- http://xjoin-search.xjoin-prod.svc.cluster.local.:4000/graphql
- http://xjoin-search.xjoin-prod.svc:4000/graphql

2) Also the following error message was noticed from dns-operator pods:

time="2020-09-29T16:32:31Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736989186, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"ClusterIP assigned to DNS Service and minimum DaemonSet pods running\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736993789, loc:(*time.Location)(0x226ff00)}}, Reason:\"Reconciling\", Message:\"53 Nodes running a DaemonSet pod, want 54\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63728611853, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"Minimum number of Nodes running DaemonSet pod\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736989186, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"ClusterIP assigned to DNS Service and minimum DaemonSet pods running\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63736993951, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"All expected Nodes running DaemonSet pod\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63728611853, loc:(*time.Location)(0x226ff00)}}, Reason:\"AsExpected\", Message:\"Minimum number of Nodes running DaemonSet pod\"}}}"

Comment 1 Andrew McDermott 2020-10-01 16:11:35 UTC

What kernel version is running here? Given that you mention 5s I wonder if it is actually related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1763206

Comment 2 Andrew McDermott 2020-10-01 16:13:59 UTC

Target set to next release version while investigation is either
ongoing or pending. Will be considered for earlier release versions
when diagnosed and resolved.

Comment 3 Rogerio Bastos 2020-10-01 16:21:18 UTC

Andrew, 
    all nodes are running the following kernel version:
Linux version 4.18.0-193.23.1.el8_2.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020

The 5sec latency is observed only during cluster upgrades, in other scenarios it responds normally, how could this be related to the BZ [1] you referenced?


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206

Comment 4 Andrew McDermott 2020-10-01 16:49:45 UTC

(In reply to Rogerio Bastos from comment #3)
> Andrew, 
>     all nodes are running the following kernel version:
> Linux version 4.18.0-193.23.1.el8_2.x86_64
> (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121
> (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Sep 17 15:34:16 EDT 2020
> 
> The 5sec latency is observed only during cluster upgrades, in other
> scenarios it responds normally, how could this be related to the BZ [1] you
> referenced?

It is/was just some speculation based on the fact you specifically mentioned 5s
and the referenced bug would specifically exhibit a 5s delay. Your kernel
version should have the fix and there may be zero correlation here.

> 
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1763206

Comment 5 Miciah Dashiel Butler Masters 2020-10-08 18:03:24 UTC

I've been able to reproduce the problem as follows.

1. In one terminal, choose an arbitrary DNS pod on an arbitrary node, and start sending queries to its address.

oc -n openshift-dns get pods -o wide
oc -n openshift-dns rsh -c dns ds/dns-default sh -c "while :; do dig @$address -p 5353 kubernetes.default.svc.cluster.local. A; sleep 1; done"

2. In a second terminal, watch the DNS endpoints.

oc -n openshift-dns get endpoints -o yaml -w

3. In a third terminal, drain and reboot the node hosting the pod chosen in Step 1:

oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0

The first terminal shows that DNS queries start timing out before the second terminal shows that the endpoints is updated.

Here's what I believe is the problem:

• The daemonset tolerates the node.kubernetes.io/unschedulable taint (if the toleration isn't already specified on the daemonset, the daemonset controller adds it[1]).

• The endpoint controller only checks the pod's "Ready" status condition to determine if the pod's address is ready[2].

• Consequently, the pod's address is in the endpoints's list of ready addresses after the node's shutdown has cause the pod to become unresponsive.

• Subsequently, the endpoint controller removes the pod's endpoint.

I'm exploring options to improve this behavior, but this seems to be a fundamental issue with draining and daemonsets; the general issue has been raised in the past upstream but remains unresolved[3].

1. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/daemon/util/daemonset_util.go#L90-L94
2. https://github.com/kubernetes/kubernetes/blob/c9c24b46c35feae41ae5d7f0963551bab4ed6f80/pkg/controller/endpoint/endpoints_controller.go#L585
3. https://github.com/kubernetes/kubernetes/issues/75482

Comment 7 Andrew McDermott 2020-10-23 15:58:19 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 9 Hongan Li 2020-10-29 09:55:40 UTC

See the comment in https://github.com/openshift/cluster-dns-operator/pull/205#issue-500267814:

"These changes make CoreDNS behave more correctly but do not seem to resolve BZ#1884053 completely: In testing, I still see that the DNS pod becomes unresponsive before the endpoints controller removes the pod's address from the endpoints when I reboot a node. It is not clear whether this is a problem with CoreDNS, the kubelet, the endpoints controller, MCO, or something else."

Actually, I tested both build with/without the fix but didn't observe significant improvements on the latency during node reboot. Always seeing two error messages "connection timed out; no servers could be reached" before the endpoints move the pod to "notReadyAddresses".

Though the readiness probe has been updated to 8181/ready and terminationGracePeriodSeconds extended to 120s, but based on above test results, I'd like to assign the BZ back for confirmation.

Hi @Miciah, if that is the results we expect then please move BZ back to ON_QA. Thanks.

Comment 10 Colin Walters 2020-11-02 18:53:01 UTC

Hmm.  So one thing we could do perhaps is change the MCO's drain to actively delete at least some daemonset pods?

Actually wait...this is triggering a memory, ah yes:
https://github.com/openshift/machine-config-operator/pull/1491

Comment 11 Colin Walters 2020-11-04 21:55:18 UTC

*** Bug 1893360 has been marked as a duplicate of this bug. ***

Comment 12 Colin Walters 2020-11-04 22:06:00 UTC

To follow up on the timeline above:

    oc adm drain --ignore-daemonsets --delete-local-data --force --grace-period=0 "$node" && oc debug "$node" -- chroot /host shutdown -r +0

This is close to what the MCO does today.  One notable thing is that the shutdown should result in SIGTERM to the active daemonset pods.

However, I just noticed DNS doesn't use `hostNetwork: true` (it can't I guess).  hat's going to mean it's dependent on the SDN, and we're going to be stopping those at the same time...so any attempt to handle SIGTERM by e.g. reaching out the apiserver marking it inactive is going to potentially race and fail.

An ordering dependency on shutdown for the pods is messy...kubernetes isn't systemd ;)

Perhaps the crude hack is just to make the SDN explicitly last.

Comment 13 Colin Walters 2020-11-04 22:19:08 UTC

OK looking a bit more carefully there's no SIGTERM handling going on here - it's just about the plain old kube Service object installed.

We need to delete the pod to update that.  I propose that we merge
https://github.com/openshift/machine-config-operator/pull/1491
and change this service to use it.

Comment 14 Colin Walters 2020-11-12 23:28:27 UTC

I thought more about this and concluded we need
https://github.com/kubernetes/kubernetes/pull/96129

Comment 15 Andrew McDermott 2020-11-16 08:26:18 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 16 Andrew McDermott 2020-12-04 16:44:43 UTC

As the proposed PR was recently reverted (https://github.com/openshift/cluster-dns-operator/pull/213) adding UpcomingSprint to look into this again.

Comment 17 Miciah Dashiel Butler Masters 2021-02-05 23:11:55 UTC

I hope to find time to test https://github.com/kubernetes/kubernetes/pull/96129 but have been focused on bug 1919737.  This bug and bug 1919737 describe different scenarios (clean reboot versus node failure) but might be solved by the same fix.  However, I will leave both bugs open for now.

Comment 21 Miciah Dashiel Butler Masters 2021-03-09 19:08:28 UTC

Reproducing the specific problem that https://github.com/openshift/cluster-dns-operator/pull/237 addresses is difficult as it involves a race condition and probably cannot be reproduced except under extreme loads.  The verify this BZ, please verify that the DNS pods in the openshift-dns namespace have the readiness probe configured to use the "/ready" path:

    "readinessProbe": {
        "failureThreshold": 3,
        "httpGet": {
            "path": "/ready",
            "port": 8181,
            "scheme": "HTTP"
        },
        "initialDelaySeconds": 10,
        "periodSeconds": 3,
        "successThreshold": 1,
        "timeoutSeconds": 3
    },

Comment 22 jechen 2021-03-10 13:11:10 UTC

Verified in 4.8.0-0.nightly-2021-03-10-053945

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-10-053945   True        False         67m     Cluster version is 4.8.0-0.nightly-2021-03-10-053945


$  oc -n openshift-dns get pod/dns-default-ddzx7 -oyaml
<---snip--->
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready               <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237
        port: 8181                 <-- verified the fix by https://github.com/openshift/cluster-dns-operator/pull/237
        scheme: HTTP               
      initialDelaySeconds: 10
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 3
<---snip--->

Comment 24 Brandi Munilla 2021-06-24 16:43:57 UTC

Hi, does this bug require doc text? If so, please update the doc text field.

Comment 26 errata-xmlrpc 2021-07-27 22:33:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438