Bug 1939723

Summary:	DNS operator goes degraded when a machine is added and removed (in serial tests)
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Candace Holman <cholman>
Networking sub component:	DNS	QA Contact:	Melvin Joseph <mjoseph>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, hongli, jchaloup, mfisher, mmasters, wking
Version:	4.7
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The node-resolver daemonset has been split out from the default dns daemonset. Node-resolver lands on every node and tolerates taints but dns no longer does. Both daemonsets still contribute to the Degraded status. Consequence: Node-resolver can land on nodes that aren't ready, and this ends up marking the whole operator as degraded. Fix: Calculate the Degraded status differently for the node-resolver daemonset, bearing in mind the toleration of taints. Result: Node-resolver status is no longer considered when calculating DNS status.	Story Points:	---
Clone Of:		Environment:	clusteroperator/dns should not change condition/Degraded
Last Closed:	2022-11-04 15:15:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-16 21:52:21 UTC

4 unexpected clusteroperator state transitions during e2e test run 

clusteroperator/dns should not change condition/Degraded

dns was Degraded=false, but became Degraded=true at 2021-03-16 18:58:27.854393083 +0000 UTC -- DNS default is degraded
dns was Degraded=true, but became Degraded=false at 2021-03-16 18:59:58.41574326 +0000 UTC -- 
dns was Degraded=false, but became Degraded=true at 2021-03-16 19:00:38.855832346 +0000 UTC -- DNS default is degraded
dns was Degraded=true, but became Degraded=false at 2021-03-16 19:01:38.490842979 +0000 UTC -- 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256

It looks like at 18:58:27 a new node is created (part of the serial test) and dns lands on it but has to wait for multus to start (multus takes a bit, maybe because its pull is slower than everyone else), then the operator starts reporting degraded super quickly.

Adding or removing a node (with drain) is a normal operation, and the operator may not go degraded due to that.

Mar 16 18:58:27.737 W ns/openshift-dns pod/dns-default-zp8t4 node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times)
Mar 16 18:58:27.764 W ns/openshift-multus pod/network-metrics-daemon-2p2vq node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times)
Mar 16 18:58:27.854 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded

Marking high because this can cause alert churn during normal operations.

Comment 1 Miciah Dashiel Butler Masters 2021-03-18 20:00:54 UTC

The operator has configured the daemonset to tolerate all taints since 4.5, so we need to look into whether this is a new issue in 4.8 (i.e., whether the errors do *not* appear in CI for earlier releases), and if so, what is causing it.  

If this is *not* a new issue, we might need to port over the grace period logic from the ingress operator.

Comment 2 W. Trevor King 2021-03-20 22:24:16 UTC

> ... whether the errors do *not* appear in CI for earlier releases...

The CI suite only learned to care about this recently [1], so previous releases will not have the 'should not change condition' reporting.  That doesn't mean they don't have the condition-changing behavior though.  And earlier releases should have events that demonstrate the behavior.  For example, [2] is a serial-4.7 job, and it has:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104/artifacts/e2e-aws-serial/e2e.log | grep clusteroperator/dns
  Mar 20 02:18:18.476 E clusteroperator/dns changed Degraded to True: DNSDegraded: DNS default is degraded
  Mar 20 02:19:38.450 W clusteroperator/dns changed Degraded to False

[1]: https://github.com/openshift/origin/pull/25918#event-4423357757
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104

Comment 6 W. Trevor King 2021-04-09 21:01:41 UTC

Example recent job [1]:

  [bz-DNS] clusteroperator/dns should not change condition/Degraded
  Run #0: Failed	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 03:33:22.463 - 22s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
    Apr 09 03:35:40.545 - 76s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded

Drilling in to the monitor changes that feed those intervals:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/build-log.txt | grep clusteroperator/dns
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 - 22s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:44.788 W clusteroperator/dns condition/Degraded status/False changed:  
  INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 - 76s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:36:57.181 W clusteroperator/dns condition/Degraded status/False changed:  
  INFO[2021-04-09T04:19:20Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded 
  ERRO[2021-04-09T04:30:41Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded 

Convenient interval chart in [2] shows that this happened during the:

  Managed cluster should grow and decrease when scaling different machineSets simultaneously

test-case, which is backed by [3].  Looks like that's iterating over all the compute MachineSets and bumping by one, followed by iterating over them all and returning to the original replicas.  So I expect that procedure (and probably just the scale up part) would reproduce this issue.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/artifacts/e2e-gcp-serial/openshift-e2e-test/artifacts/e2e-intervals.html
[3]: https://github.com/openshift/origin/blob/0b4ab1c57dfa4aa1e82b5cddf9ee13f359fe3f05/test/extended/machines/scale.go#L142-L258

Comment 12 Hongan Li 2021-06-07 10:24:39 UTC

still seeing below failure in recent jobs[1][2]

[bz-DNS] clusteroperator/dns should not change condition/Degraded
Run #0: Failed 	0s
1 unexpected clusteroperator state transitions during e2e test run 

Jun 07 08:43:30.422 - 19s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401806876223475712

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401790952518979584

Comment 15 Miciah Dashiel Butler Masters 2021-08-19 16:21:01 UTC

*** Bug 1995575 has been marked as a duplicate of this bug. ***

Comment 20 Miciah Dashiel Butler Masters 2022-01-24 17:59:05 UTC

Moving out of 4.10.  We'll try to get this in the next release.

Comment 26 mfisher 2022-11-04 15:15:56 UTC

This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance.  If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug.