Bug 1939723
| Summary: | DNS operator goes degraded when a machine is added and removed (in serial tests) | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Networking | Assignee: | Candace Holman <cholman> |
| Networking sub component: | DNS | QA Contact: | Melvin Joseph <mjoseph> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | aos-bugs, hongli, jchaloup, mfisher, mmasters, wking |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The node-resolver daemonset has been split out from the default dns daemonset. Node-resolver lands on every node and tolerates taints but dns no longer does. Both daemonsets still contribute to the Degraded status.
Consequence: Node-resolver can land on nodes that aren't ready, and this ends up marking the whole operator as degraded.
Fix: Calculate the Degraded status differently for the node-resolver daemonset, bearing in mind the toleration of taints.
Result: Node-resolver status is no longer considered when calculating DNS status.
|
Story Points: | --- |
| Clone Of: | Environment: |
clusteroperator/dns should not change condition/Degraded
|
|
| Last Closed: | 2022-11-04 15:15:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2021-03-16 21:52:21 UTC
The operator has configured the daemonset to tolerate all taints since 4.5, so we need to look into whether this is a new issue in 4.8 (i.e., whether the errors do *not* appear in CI for earlier releases), and if so, what is causing it. If this is *not* a new issue, we might need to port over the grace period logic from the ingress operator. > ... whether the errors do *not* appear in CI for earlier releases... The CI suite only learned to care about this recently [1], so previous releases will not have the 'should not change condition' reporting. That doesn't mean they don't have the condition-changing behavior though. And earlier releases should have events that demonstrate the behavior. For example, [2] is a serial-4.7 job, and it has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104/artifacts/e2e-aws-serial/e2e.log | grep clusteroperator/dns Mar 20 02:18:18.476 E clusteroperator/dns changed Degraded to True: DNSDegraded: DNS default is degraded Mar 20 02:19:38.450 W clusteroperator/dns changed Degraded to False [1]: https://github.com/openshift/origin/pull/25918#event-4423357757 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1373083401094959104 Example recent job [1]:
[bz-DNS] clusteroperator/dns should not change condition/Degraded
Run #0: Failed 0s
2 unexpected clusteroperator state transitions during e2e test run
Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
Drilling in to the monitor changes that feed those intervals:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/build-log.txt | grep clusteroperator/dns
INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded
INFO[2021-04-09T04:19:20Z] Apr 09 03:33:22.463 - 22s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
INFO[2021-04-09T04:19:20Z] Apr 09 03:33:44.788 W clusteroperator/dns condition/Degraded status/False changed:
INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 E clusteroperator/dns condition/Degraded status/True reason/DNSDegraded changed: DNS default is degraded
INFO[2021-04-09T04:19:20Z] Apr 09 03:35:40.545 - 76s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded
INFO[2021-04-09T04:19:20Z] Apr 09 03:36:57.181 W clusteroperator/dns condition/Degraded status/False changed:
INFO[2021-04-09T04:19:20Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded
ERRO[2021-04-09T04:30:41Z] [bz-DNS] clusteroperator/dns should not change condition/Degraded
Convenient interval chart in [2] shows that this happened during the:
Managed cluster should grow and decrease when scaling different machineSets simultaneously
test-case, which is backed by [3]. Looks like that's iterating over all the compute MachineSets and bumping by one, followed by iterating over them all and returning to the original replicas. So I expect that procedure (and probably just the scale up part) would reproduce this issue.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-serial/1380353148589182976/artifacts/e2e-gcp-serial/openshift-e2e-test/artifacts/e2e-intervals.html
[3]: https://github.com/openshift/origin/blob/0b4ab1c57dfa4aa1e82b5cddf9ee13f359fe3f05/test/extended/machines/scale.go#L142-L258
still seeing below failure in recent jobs[1][2] [bz-DNS] clusteroperator/dns should not change condition/Degraded Run #0: Failed 0s 1 unexpected clusteroperator state transitions during e2e test run Jun 07 08:43:30.422 - 19s E clusteroperator/dns condition/Degraded status/True reason/DNS default is degraded [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401806876223475712 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial/1401790952518979584 *** Bug 1995575 has been marked as a duplicate of this bug. *** Moving out of 4.10. We'll try to get this in the next release. This issue is stale and closed because it has no activity for a significant amount of time and is reported on a version no longer in maintenance. If this issue should not be closed please verify the condition still exists on a supported release and submit an updated bug. |