1761506 – dns operator degraded status flaps during rollouts

Bug 1761506 - dns operator degraded status flaps during rollouts

Summary: dns operator degraded status flaps during rollouts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1762960
TreeView+	depends on / blocked

Reported:	2019-10-14 14:31 UTC by Dan Mace
Modified:	2022-08-04 22:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1762960 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:07:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 134	0	'None'	closed	Bug 1761506: status: prevent degraded status flapping on rollout	2021-02-10 14:43:45 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:08:05 UTC

Description Dan Mace 2019-10-14 14:31:26 UTC

Description of problem:

During a rollout, the dns-operator degraded status will flap. Noticed most lately in https://bugzilla.redhat.com/show_bug.cgi?id=1760473#c1:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded'
Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs];
Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False
Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available

But I think a better example of flapping is:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded'
Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists
Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available
Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Hongan Li 2019-10-17 08:21:28 UTC

verified with 4.3.0-0.nightly-2019-10-16-194525.

### force one dns pod to be unavailable, the ”Degrade“ status is False
$ oc -n openshift-dns get ds
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   5         5         4       5            4           kubernetes.io/os=linux   6h26m
$ oc get dns.operator -o yaml
<---snip--->
    conditions:
    - lastTransitionTime: "2019-10-17T02:40:34Z"
      message: ClusterIP assigned to DNS Service and minimum DaemonSet pods running
      reason: AsExpected
      status: "False"
      type: Degraded
    - lastTransitionTime: "2019-10-17T07:41:39Z"
      message: 4 Nodes running a DaemonSet pod, want 5
      reason: Reconciling
      status: "True"
      type: Progressing
$ oc get co/dns
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns    4.3.0-0.nightly-2019-10-16-194525   True        True          False      6h28m

### force two dns pod to be unavailable, the ”Degrade“ status is True
$ oc get ds -n openshift-dns
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   5         5         3       5            3           kubernetes.io/os=linux   6h40m
$ oc get dns.operator -o yaml
<---snip--->
    conditions:
    - lastTransitionTime: "2019-10-17T07:54:03Z"
      message: Too many unavailable CoreDNS pods (2 > 1 max unavailable)
      reason: MaxUnavailableExceeded
      status: "True"
      type: Degraded

$ oc get co/dns
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns    4.3.0-0.nightly-2019-10-16-194525   True        True          True       6h40m

Comment 3 Hongan Li 2019-11-05 11:58:18 UTC

just noticed that two nodes (one master and one worker) were possibly in "NotReady" at same time during upgrade, so still can found DNS Degraded with reason: NotAllDNSesAvailable.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.2     True        True          178m    Working towards 4.3.0-0.nightly-2019-11-02-092336: 13% complete

$ oc get node
NAME                                                        STATUS                        ROLES    AGE     VERSION
compute-0                                                   Ready                         worker   7h26m   v1.14.6+7e13ab9a7
control-plane-0                                             NotReady,SchedulingDisabled   master   7h26m   v1.16.2
control-plane-1                                             Ready                         master   7h26m   v1.14.6+7e13ab9a7
control-plane-2                                             Ready                         master   7h26m   v1.14.6+7e13ab9a7
rhel77-0.weinliu-422-upgrade2.qe.devcluster.openshift.com   Ready                         worker   4h25m   v1.14.6+0365fa172
rhel77-1.weinliu-422-upgrade2.qe.devcluster.openshift.com   NotReady,SchedulingDisabled   worker   4h25m   v1.14.6+0365fa172

$ oc get ds -n openshift-dns
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   6         6         4       6            4           kubernetes.io/os=linux   7h25m

$ oc get co/dns -o yaml
<---snip--->
status:
  conditions:
  - lastTransitionTime: "2019-11-05T08:53:50Z"
    message: Not all desired DNS DaemonSets available
    reason: NotAllDNSesAvailable
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-11-05T08:53:30Z"
    message: Not all DNS DaemonSets available.
    reason: Reconciling
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-11-05T04:09:04Z"
    message: At least 1 DNS DaemonSet available
    reason: AsExpected
    status: "True"
    type: Available

$ oc get dns.operator -o yaml
<---snip--->
  status:
    clusterDomain: cluster.local
    clusterIP: 172.30.0.10
    conditions:
    - lastTransitionTime: "2019-11-05T08:53:50Z"
      message: Too many unavailable CoreDNS pods (2 > 1 max unavailable)
      reason: MaxUnavailableExceeded
      status: "True"
      type: Degraded
    - lastTransitionTime: "2019-11-05T08:53:50Z"
      message: 4 Nodes running a DaemonSet pod, want 6
      reason: Reconciling
      status: "True"
      type: Progressing
    - lastTransitionTime: "2019-11-05T04:09:04Z"
      message: Minimum number of Nodes running DaemonSet pod
      reason: AsExpected
      status: "True"
      type: Available

So I reopen this to consider if we should adjust max unavailable to 2.

Comment 4 Dan Mace 2019-11-07 15:20:56 UTC

Hongan,

Prior to the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1753059 (which just merged yesterday), we were scheduling DNS pods even on tainted/unschedulable nodes which is what looks like you're seeing. In this scenario, I wonder if the desired replica count should be 4 instead of 6. I think the toleration fix would achieve that.

Mind giving it another try with the latest payload? If that helps I think for now I'd prefer to leave max unavailable at 1 for the time being.

Thank you for your detailed testing!

Comment 5 Hongan Li 2019-11-08 02:57:50 UTC

Yes, that make sense.
Actually it has two phases during upgrade. First one is co/dns upgrade, all nodes should be ready and max unavailable dns pod is 1 during that time (fixed in this BZ).
Second one is co/machine-config upgrade, it may marks more than one nodes as NotReady simultaneously then cause DNS Degraded. If the desired replica count can be adjusted properly in this case then should fix the issue.

So let's move to https://bugzilla.redhat.com/show_bug.cgi?id=1753059 to track the latter. Moving this back to verified. Thank you Dan.

Comment 7 errata-xmlrpc 2020-01-23 11:07:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.