Bug 1814446
Summary: | failed to initialize the cluster: Cluster operator openshift-controller-manager is still updating (operator version not set) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | openshift-controller-manager | Assignee: | Gabe Montero <gmontero> | |
Status: | CLOSED ERRATA | QA Contact: | wewang <wewang> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.3.z | CC: | adam.kaplan, aos-bugs, gmontero, mfojtik, pmuller | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | devex | |||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: timing windows existed where the openshift controller manager operator would set its progressing condition to false even though it had not registered its version
Consequence: the openshift controller manager operator would not fully satisfy its contract with the install/upgrade and the install/upgrade would ultimately fail
Fix: the timing window which allowed progressing to be set to false prior to setting the version was corrected via code change
Result: the openshift controller manager operator now more readily reports both its version and its progressing condition upon successful install/upgrade
|
Story Points: | --- | |
Clone Of: | ||||
: | 1852249 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:22:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1852249 |
Description
W. Trevor King
2020-03-17 21:59:27 UTC
This seems to have caused two failures in a row in the release-openshift-origin-installer-e2e-gcp-serial-4.3 job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1054#1:build-log.txt%3A31 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1053#1:build-log.txt%3A31 Ok I think I have a plausible theory on what happened after triaging https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152 1) there is a ton api server throttling in the OCM-O log throughout ... so odd timings are certainly conceivable 2) confirmed the version is not set 3) however, the last OCM-O operator during the run around operator status is setting progressging to false: I0317 04:15:00.538222 1 status_controller.go:165] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-03-17T04:15:00Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-03-17T04:06:36Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} 4) so how does that happen? 5) so setting progressing to false happens here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139 6) and the setting of the version happens above here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139 7) but what happens if all the API churn means we our daemonset has not received the version annotation here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117 8) it means we skip setting the versions, but set progressing to false afterward of the replica counts are correct Certainly an edge case, but again plausible. Quotes from Sherlock Holmes (with future citings by Spock) come to mind about "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." :-) So we are going to add a message when and keep progressing true when https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117 returns false forgot to mention an element of the theory: 6.5) something happens to the daemon set update done at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L359 4.5 gcp serical job did not met the issue, so verify it. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/1172 Is this getting backported? Came up again in 4.3.9 -> 4.3.27 CI [1,2]. On the other hand, seems like the only time that happened in CI in the past ~14d [3]. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1275691585048154112 [2]: https://github.com/openshift/cincinnati-graph-data/pull/298#issuecomment-651373653 [3]: https://search.svc.ci.openshift.org/?search=Cluster+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition%3A+Cluster+operator+openshift-controller-manager+is+still+updating&maxAge=336h&type=junit&name=upgrade%7Claunch&groupBy=job the change is small/safe .... it has to go to 4.4 first, but sure I'll start the backporting process and see how well it is received Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |