A 4.3 GCP release-promotion job (4.3.0-0.nightly-2020-03-16-193416) failed with [1]: level=fatal msg="failed to initialize the cluster: Cluster operator openshift-controller-manager is still updating" 2020/03/17 04:43:51 Container setup in pod e2e-gcp failed, exit code 1, reason Error That's not all that much detail, cf. the CVO complaining about that operator in 4.0's bug 1694216. Looking at the ClusterOperator gathered post-test: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .reason' | sort 2020-03-17T04:05:16Z Degraded False AsExpected 2020-03-17T04:05:16Z Upgradeable Unknown NoData 2020-03-17T04:06:36Z Available True AsExpected 2020-03-17T04:15:00Z Progressing False AsExpected So that all seems fine, and like things went happy well before the ~4:43 timeout. Checking the manifests: $ oc adm release extract --to=manifests registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-03-16-193416 Extracted release payload from digest sha256:7a3286b8fa474f5c52a6795bc56f9225eb2d343c775105878b10a6fad73c1f2f created at 2020-03-16T19:44:59Z $ cat manifests/0000_50_cluster-openshift-controller-manager-operator_10_clusteroperator.yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: name: openshift-controller-manager annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" spec: {} status: versions: - name: operator version: "4.3.0-0.nightly-2020-03-16-193416" So the problem is that the operator is not setting its version: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status | keys[]' conditions extension relatedObjects Docs on the version requirement in [2] (which also apply to 4.3, but the docs weren't backported). For comparison, the next job (also running 4.3.0-0.nightly-2020-03-16-193416) does set the operator version [3]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1153/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status.versions' [ { "name": "operator", "version": "4.3.0-0.nightly-2020-03-16-193416" } ] Not clear to me why the operator version was not set in job 1152, but maybe the operator's pod logs will help [4,5]? Flake seems rare, with job 1152 being the only instance of this setup-time failure in the past 24h [6]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152 [2]: https://github.com/openshift/cluster-version-operator/blob/release-4.4/docs/user/reconciliation.md#clusteroperator [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1153 [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-6df5b984ff-mdwjq_operator_previous.log [5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-6df5b984ff-mdwjq_operator.log [6]: https://search.svc.ci.openshift.org/chart?search=failed%20to%20initialize%20the%20cluster:%20Cluster%20operator%20openshift-controller-manager%20is%20still%20updating
This seems to have caused two failures in a row in the release-openshift-origin-installer-e2e-gcp-serial-4.3 job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1054#1:build-log.txt%3A31 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1053#1:build-log.txt%3A31
Ok I think I have a plausible theory on what happened after triaging https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152 1) there is a ton api server throttling in the OCM-O log throughout ... so odd timings are certainly conceivable 2) confirmed the version is not set 3) however, the last OCM-O operator during the run around operator status is setting progressging to false: I0317 04:15:00.538222 1 status_controller.go:165] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-03-17T04:15:00Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-03-17T04:06:36Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} 4) so how does that happen? 5) so setting progressing to false happens here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139 6) and the setting of the version happens above here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139 7) but what happens if all the API churn means we our daemonset has not received the version annotation here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117 8) it means we skip setting the versions, but set progressing to false afterward of the replica counts are correct Certainly an edge case, but again plausible. Quotes from Sherlock Holmes (with future citings by Spock) come to mind about "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." :-) So we are going to add a message when and keep progressing true when https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117 returns false
forgot to mention an element of the theory: 6.5) something happens to the daemon set update done at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L359
4.5 gcp serical job did not met the issue, so verify it. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/1172
Is this getting backported? Came up again in 4.3.9 -> 4.3.27 CI [1,2]. On the other hand, seems like the only time that happened in CI in the past ~14d [3]. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1275691585048154112 [2]: https://github.com/openshift/cincinnati-graph-data/pull/298#issuecomment-651373653 [3]: https://search.svc.ci.openshift.org/?search=Cluster+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition%3A+Cluster+operator+openshift-controller-manager+is+still+updating&maxAge=336h&type=junit&name=upgrade%7Claunch&groupBy=job
the change is small/safe .... it has to go to 4.4 first, but sure I'll start the backporting process and see how well it is received
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409