4 unexpected clusteroperator state transitions during e2e test run clusteroperator/dns should not change condition/Degraded dns was Degraded=false, but became Degraded=true at 2021-03-16 18:58:27.854393083 +0000 UTC -- DNS default is degraded dns was Degraded=true, but became Degraded=false at 2021-03-16 18:59:58.41574326 +0000 UTC -- dns was Degraded=false, but became Degraded=true at 2021-03-16 19:00:38.855832346 +0000 UTC -- DNS default is degraded dns was Degraded=true, but became Degraded=false at 2021-03-16 19:01:38.490842979 +0000 UTC -- https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256 It looks like at 18:58:27 a new node is created (part of the serial test) and dns lands on it but has to wait for multus to start (multus takes a bit, maybe because its pull is slower than everyone else), then the operator starts reporting degraded super quickly. Adding or removing a node (with drain) is a normal operation, and the operator may not go degraded due to that. Mar 16 18:58:27.737 W ns/openshift-dns pod/dns-default-zp8t4 node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times) Mar 16 18:58:27.764 W ns/openshift-multus pod/network-metrics-daemon-2p2vq node/ip-10-0-138-230.us-west-2.compute.internal reason/NetworkNotReady network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? (3 times) Mar 16 18:58:27.854 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded Marking high because this can cause alert churn during normal operations.
The operator has configured the daemonset to tolerate all taints since 4.5, so we need to look into whether this is a new issue in 4.8 (i.e., whether the errors do *not* appear in CI for earlier releases), and if so, what is causing it. If this is *not* a new issue, we might need to port over the grace period logic from the ingress operator.
> ... whether the errors do *not* appear in CI for earlier releases... The CI suite only learned to care about this recently [1], so previous releases will not have the 'should not change condition' reporting. That doesn't mean they don't have the condition-changing behavior though. And earlier releases should have events that demonstrate the behavior. For example, [2] is a serial-4.7 job, and it has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104/artifacts/e2e-aws-serial/e2e.log | grep clusteroperator/dns Mar 20 02:18:18.476 E clusteroperator/dns changed Degraded to True: DNSDegraded: DNS default is degraded Mar 20 02:19:38.450 W clusteroperator/dns changed Degraded to False [1]: https://github.com/openshift/origin/pull/25918#event-4423357757 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104
Example recent job [1]: [bz-DNS] clusteroperator/dns should not change condition/Degraded Run #0: Failed 0s 2 unexpected clusteroperator state transitions during e2e test run Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded Drilling in to the monitor changes that feed those intervals: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/build-log.txt | grep clusteroperator/dns INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:33:44.788 W clusteroperator/dns condition/Degraded status/False changed: INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:36:57.181 W clusteroperator/dns condition/Degraded status/False changed: INFO[2021-04-09T04:19:20Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded ERRO[2021-04-09T04:30:41Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded Convenient interval chart in [2] shows that this happened during the: Managed cluster should grow and decrease when scaling different machineSets simultaneously test-case, which is backed by [3]. Looks like that's iterating over all the compute MachineSets and bumping by one, followed by iterating over them all and returning to the original replicas. So I expect that procedure (and probably just the scale up part) would reproduce this issue. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/artifacts/e2e-gcp-serial/openshift-e2e-test/artifacts/e2e-intervals.html [3]: https://github.com/openshift/origin/blob/0b4ab1c57dfa4aa1e82b5cddf9ee13f359fe3f05/test/extended/machines/scale.go#L142-L258
still seeing below failure in recent jobs[1][2] [bz-DNS] clusteroperator/dns should not change condition/Degraded Run #0: Failed 0s 1 unexpected clusteroperator state transitions during e2e test run Jun 07 08:43:30.422 - 19s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401806876223475712 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401790952518979584
*** Bug 1995575 has been marked as a duplicate of this bug. ***
Moving out of 4.10. We'll try to get this in the next release.
This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance. If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug.