Bug 1814446

Summary:	failed to initialize the cluster: Cluster operator openshift-controller-manager is still updating (operator version not set)
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	openshift-controller-manager	Assignee:	Gabe Montero <gmontero>
Status:	CLOSED ERRATA	QA Contact:	wewang <wewang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	adam.kaplan, aos-bugs, gmontero, mfojtik, pmuller
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	devex
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: timing windows existed where the openshift controller manager operator would set its progressing condition to false even though it had not registered its version Consequence: the openshift controller manager operator would not fully satisfy its contract with the install/upgrade and the install/upgrade would ultimately fail Fix: the timing window which allowed progressing to be set to false prior to setting the version was corrected via code change Result: the openshift controller manager operator now more readily reports both its version and its progressing condition upon successful install/upgrade	Story Points:	---
Clone Of:
Clones:	1852249 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:22:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1852249

Description W. Trevor King 2020-03-17 21:59:27 UTC

A 4.3 GCP release-promotion job (4.3.0-0.nightly-2020-03-16-193416) failed with [1]:

level=fatal msg="failed to initialize the cluster: Cluster operator openshift-controller-manager is still updating"
2020/03/17 04:43:51 Container setup in pod e2e-gcp failed, exit code 1, reason Error 

That's not all that much detail, cf. the CVO complaining about that operator in 4.0's bug 1694216.  Looking at the ClusterOperator gathered post-test:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .reason' | sort
2020-03-17T04:05:16Z Degraded False AsExpected
2020-03-17T04:05:16Z Upgradeable Unknown NoData
2020-03-17T04:06:36Z Available True AsExpected
2020-03-17T04:15:00Z Progressing False AsExpected

So that all seems fine, and like things went happy well before the ~4:43 timeout.  Checking the manifests:

$ oc adm release extract --to=manifests registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-03-16-193416
Extracted release payload from digest sha256:7a3286b8fa474f5c52a6795bc56f9225eb2d343c775105878b10a6fad73c1f2f created at 2020-03-16T19:44:59Z
$ cat manifests/0000_50_cluster-openshift-controller-manager-operator_10_clusteroperator.yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  name: openshift-controller-manager
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
spec: {}
status:
  versions:
    - name: operator
      version: "4.3.0-0.nightly-2020-03-16-193416"

So the problem is that the operator is not setting its version:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status | keys[]'
conditions
extension
relatedObjects

Docs on the version requirement in [2] (which also apply to 4.3, but the docs weren't backported).

For comparison, the next job (also running 4.3.0-0.nightly-2020-03-16-193416) does set the operator version [3]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1153/artifacts/e2e-gcp/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "openshift-controller-manager").status.versions'
[
  {
    "name": "operator",
    "version": "4.3.0-0.nightly-2020-03-16-193416"
  }
]

Not clear to me why the operator version was not set in job 1152, but maybe the operator's pod logs will help [4,5]?

Flake seems rare, with job 1152 being the only instance of this setup-time failure in the past 24h [6].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152
[2]: https://github.com/openshift/cluster-version-operator/blob/release-4.4/docs/user/reconciliation.md#clusteroperator
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1153
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-6df5b984ff-mdwjq_operator_previous.log
[5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152/artifacts/e2e-gcp/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-6df5b984ff-mdwjq_operator.log
[6]: https://search.svc.ci.openshift.org/chart?search=failed%20to%20initialize%20the%20cluster:%20Cluster%20operator%20openshift-controller-manager%20is%20still%20updating

Comment 2 Petr Muller 2020-03-24 17:24:39 UTC

This seems to have caused two failures in a row in the release-openshift-origin-installer-e2e-gcp-serial-4.3 job:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1054#1:build-log.txt%3A31
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/1053#1:build-log.txt%3A31

Comment 3 Gabe Montero 2020-05-13 14:25:48 UTC

Ok I think I have a plausible theory on what happened after triaging https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.3/1152

1) there is a ton api server throttling in the OCM-O log throughout ... so odd timings are certainly conceivable
2) confirmed the version is not set
3) however, the last OCM-O operator during the run around operator status is setting progressging to false:

I0317 04:15:00.538222       1 status_controller.go:165] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-03-17T04:15:00Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2020-03-17T04:06:36Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-03-17T04:05:16Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}

4) so how does that happen?
5) so setting progressing to false happens here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139
6) and the setting of the version happens above here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L135-L139
7) but what happens if all the API churn means we our daemonset has not received the version annotation here https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117
8) it means we skip setting the versions, but set progressing to false afterward of the replica counts are correct

Certainly an edge case, but again plausible.  Quotes from Sherlock Holmes (with future citings by Spock) come to mind about "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." :-)

So we are going to add a message when and keep progressing true when https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L117 returns false

Comment 4 Gabe Montero 2020-05-13 14:53:26 UTC

forgot to mention an element of the theory:

6.5) something happens to the daemon set update done at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/sync_openshiftcontrollermanager_v311_00.go#L359

Comment 7 wewang 2020-05-18 09:18:31 UTC

4.5 gcp serical job did not met the issue, so verify it.
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/1172

Comment 8 W. Trevor King 2020-06-29 21:24:41 UTC

Is this getting backported?  Came up again in 4.3.9 -> 4.3.27 CI [1,2].  On the other hand, seems like the only time that happened in CI in the past ~14d [3].

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1275691585048154112
[2]: https://github.com/openshift/cincinnati-graph-data/pull/298#issuecomment-651373653
[3]: https://search.svc.ci.openshift.org/?search=Cluster+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition%3A+Cluster+operator+openshift-controller-manager+is+still+updating&maxAge=336h&type=junit&name=upgrade%7Claunch&groupBy=job

Comment 9 Gabe Montero 2020-06-29 21:56:19 UTC

the change is small/safe .... it has to go to 4.4 first, but sure I'll start the backporting process and see how well it is received

Comment 11 errata-xmlrpc 2020-07-13 17:22:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409