Bug 1939723 - DNS operator goes degraded when a machine is added and removed (in serial tests)
Summary: DNS operator goes degraded when a machine is added and removed (in serial tests)
Keywords:
Status: POST
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Candace Holman
QA Contact: Melvin Joseph
URL:
Whiteboard:
: 1995575 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-16 21:52 UTC by Clayton Coleman
Modified: 2022-08-04 22:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The node-resolver daemonset has been split out from the default dns daemonset. Node-resolver lands on every node and tolerates taints but dns no longer does. Both daemonsets still contribute to the Degraded status. Consequence: Node-resolver can land on nodes that aren't ready, and this ends up marking the whole operator as degraded. Fix: Calculate the Degraded status differently for the node-resolver daemonset, bearing in mind the toleration of taints. Result: Node-resolver status is no longer considered when calculating DNS status.
Clone Of:
Environment:
clusteroperator/dns should not change condition/Degraded
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 273 0 None Merged Bug 1939723: Don't check node-resolver status for DNS Degraded condition 2022-06-14 20:26:45 UTC
Github openshift cluster-dns-operator pull 290 0 None open Bug 1939723: use a grace period for requeuable degraded conditions 2022-06-17 15:01:12 UTC

Description Clayton Coleman 2021-03-16 21:52:21 UTC
4 unexpected clusteroperator state transitions during e2e test run 

clusteroperator/dns should not change condition/Degraded

dns was Degraded=false, but became Degraded=true at 2021-03-16 18:58:27.854393083 +0000 UTC -- DNS default is degraded
dns was Degraded=true, but became Degraded=false at 2021-03-16 18:59:58.41574326 +0000 UTC -- 
dns was Degraded=false, but became Degraded=true at 2021-03-16 19:00:38.855832346 +0000 UTC -- DNS default is degraded
dns was Degraded=true, but became Degraded=false at 2021-03-16 19:01:38.490842979 +0000 UTC -- 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256

It looks like at 18:58:27 a new node is created (part of the serial test) and dns lands on it but has to wait for multus to start (multus takes a bit, maybe because its pull is slower than everyone else), then the operator starts reporting degraded super quickly.

Adding or removing a node (with drain) is a normal operation, and the operator may not go degraded due to that.

Mar 16 18:58:27.737 W ns/openshift-dns pod/dns-default-zp8t4 node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times)
Mar 16 18:58:27.764 W ns/openshift-multus pod/network-metrics-daemon-2p2vq node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times)
Mar 16 18:58:27.854 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded

Marking high because this can cause alert churn during normal operations.

Comment 1 Miciah Dashiel Butler Masters 2021-03-18 20:00:54 UTC
The operator has configured the daemonset to tolerate all taints since 4.5, so we need to look into whether this is a new issue in 4.8 (i.e., whether the errors do *not* appear in CI for earlier releases), and if so, what is causing it.  

If this is *not* a new issue, we might need to port over the grace period logic from the ingress operator.

Comment 2 W. Trevor King 2021-03-20 22:24:16 UTC
> ... whether the errors do *not* appear in CI for earlier releases...

The CI suite only learned to care about this recently [1], so previous releases will not have the 'should not change condition' reporting.  That doesn't mean they don't have the condition-changing behavior though.  And earlier releases should have events that demonstrate the behavior.  For example, [2] is a serial-4.7 job, and it has:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104/artifacts/e2e-aws-serial/e2e.log | grep clusteroperator/dns
  Mar 20 02:18:18.476 E clusteroperator/dns changed Degraded to True: DNSDegraded: DNS default is degraded
  Mar 20 02:19:38.450 W clusteroperator/dns changed Degraded to False

[1]: https://github.com/openshift/origin/pull/25918#event-4423357757
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104

Comment 6 W. Trevor King 2021-04-09 21:01:41 UTC
Example recent job [1]:

  [bz-DNS] clusteroperator/dns should not change condition/Degraded
  Run #0: Failed	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 03:33:22.463 - 22s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
    Apr 09 03:35:40.545 - 76s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded

Drilling in to the monitor changes that feed those intervals:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/build-log.txt | grep clusteroperator/dns
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 - 22s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:33:44.788 W clusteroperator/dns condition/Degraded status/False changed:  
  INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 - 76s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded 
  INFO[2021-04-09T04:19:20Z] Apr 09 03:36:57.181 W clusteroperator/dns condition/Degraded status/False changed:  
  INFO[2021-04-09T04:19:20Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded 
  ERRO[2021-04-09T04:30:41Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded 

Convenient interval chart in [2] shows that this happened during the:

  Managed cluster should grow and decrease when scaling different machineSets simultaneously

test-case, which is backed by [3].  Looks like that's iterating over all the compute MachineSets and bumping by one, followed by iterating over them all and returning to the original replicas.  So I expect that procedure (and probably just the scale up part) would reproduce this issue.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/artifacts/e2e-gcp-serial/openshift-e2e-test/artifacts/e2e-intervals.html
[3]: https://github.com/openshift/origin/blob/0b4ab1c57dfa4aa1e82b5cddf9ee13f359fe3f05/test/extended/machines/scale.go#L142-L258

Comment 12 Hongan Li 2021-06-07 10:24:39 UTC
still seeing below failure in recent jobs[1][2]

[bz-DNS] clusteroperator/dns should not change condition/Degraded
Run #0: Failed 	0s
1 unexpected clusteroperator state transitions during e2e test run 

Jun 07 08:43:30.422 - 19s   E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401806876223475712

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401790952518979584

Comment 15 Miciah Dashiel Butler Masters 2021-08-19 16:21:01 UTC
*** Bug 1995575 has been marked as a duplicate of this bug. ***

Comment 20 Miciah Dashiel Butler Masters 2022-01-24 17:59:05 UTC
Moving out of 4.10.  We'll try to get this in the next release.


Note You need to log in before you can comment on or make changes to this bug.