Bug 1939723
Summary: | DNS operator goes degraded when a machine is added and removed (in serial tests) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Networking | Assignee: | Candace Holman <cholman> |
Networking sub component: | DNS | QA Contact: | Melvin Joseph <mjoseph> |
Status: | CLOSED WONTFIX | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | aos-bugs, hongli, jchaloup, mfisher, mmasters, wking |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The node-resolver daemonset has been split out from the default dns daemonset. Node-resolver lands on every node and tolerates taints but dns no longer does. Both daemonsets still contribute to the Degraded status.
Consequence: Node-resolver can land on nodes that aren't ready, and this ends up marking the whole operator as degraded.
Fix: Calculate the Degraded status differently for the node-resolver daemonset, bearing in mind the toleration of taints.
Result: Node-resolver status is no longer considered when calculating DNS status.
|
Story Points: | --- |
Clone Of: | Environment: |
clusteroperator/dns should not change condition/Degraded
|
|
Last Closed: | 2022-11-04 15:15:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2021-03-16 21:52:21 UTC
The operator has configured the daemonset to tolerate all taints since 4.5, so we need to look into whether this is a new issue in 4.8 (i.e., whether the errors do *not* appear in CI for earlier releases), and if so, what is causing it. If this is *not* a new issue, we might need to port over the grace period logic from the ingress operator. > ... whether the errors do *not* appear in CI for earlier releases... The CI suite only learned to care about this recently [1], so previous releases will not have the 'should not change condition' reporting. That doesn't mean they don't have the condition-changing behavior though. And earlier releases should have events that demonstrate the behavior. For example, [2] is a serial-4.7 job, and it has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104/artifacts/e2e-aws-serial/e2e.log | grep clusteroperator/dns Mar 20 02:18:18.476 E clusteroperator/dns changed Degraded to True: DNSDegraded: DNS default is degraded Mar 20 02:19:38.450 W clusteroperator/dns changed Degraded to False [1]: https://github.com/openshift/origin/pull/25918#event-4423357757 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104 Example recent job [1]: [bz-DNS] clusteroperator/dns should not change condition/Degraded Run #0: Failed 0s 2 unexpected clusteroperator state transitions during e2e test run Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded Drilling in to the monitor changes that feed those intervals: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/build-log.txt | grep clusteroperator/dns INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:33:44.788 W clusteroperator/dns condition/Degraded status/False changed: INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded INFO[2021-04-09T04:19:20Z] Apr 09 03:36:57.181 W clusteroperator/dns condition/Degraded status/False changed: INFO[2021-04-09T04:19:20Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded ERRO[2021-04-09T04:30:41Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded Convenient interval chart in [2] shows that this happened during the: Managed cluster should grow and decrease when scaling different machineSets simultaneously test-case, which is backed by [3]. Looks like that's iterating over all the compute MachineSets and bumping by one, followed by iterating over them all and returning to the original replicas. So I expect that procedure (and probably just the scale up part) would reproduce this issue. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/artifacts/e2e-gcp-serial/openshift-e2e-test/artifacts/e2e-intervals.html [3]: https://github.com/openshift/origin/blob/0b4ab1c57dfa4aa1e82b5cddf9ee13f359fe3f05/test/extended/machines/scale.go#L142-L258 still seeing below failure in recent jobs[1][2] [bz-DNS] clusteroperator/dns should not change condition/Degraded Run #0: Failed 0s 1 unexpected clusteroperator state transitions during e2e test run Jun 07 08:43:30.422 - 19s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401806876223475712 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401790952518979584 *** Bug 1995575 has been marked as a duplicate of this bug. *** Moving out of 4.10. We'll try to get this in the next release. This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance. If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug. |