One 4.2.4 -> 4.2.13 Azure CI job failed with [1]: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator marketplace has not yet reported success ... failed: (1h21m18s) 2020-01-03T19:13:37 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" The marketplace ClusterOperator had [2]: - lastTransitionTime: 2020-01-03T18:22:54Z message: The operator has exited reason: OperatorExited status: "False" type: Available The code for that OperatorExited is [3], and seems to be tied to graceful shutdowns. There was some low-grade etcd latency issues due to Azure's slower disks: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/build-log.txt | grep -c 'etcdserver: request timed out' 2 The marketplace-operator pod had restarted once at 2020-01-03T18:24:01Z [4]. The previous container exited 2 [4], but it's logs don't say why: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/previous.log | tail -n1 2020-01-03T18:23:40.5175711Z time="2020-01-03T18:23:40Z" level=info msg="Config API is available" The current container has the very strange claim about the ClusterOperator API: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/current.log | grep -1 status 2020-01-03T18:25:07.5795305Z time="2020-01-03T18:25:07Z" level=info msg="[defaults] OperatorSource redhat-operators default and on cluster specs are same" 2020-01-03T18:25:07.5797361Z time="2020-01-03T18:25:07Z" level=warning msg="[status] ClusterOperator API not present" 2020-01-03T18:25:20.212361Z time="2020-01-03T18:25:20Z" level=warning msg="[status] ClusterOperator API not present" 2020-01-03T18:25:20.2125821Z time="2020-01-03T18:25:20Z" level=info msg="[sync] Operator source sync loop will start after 1m0s" despite the marketplace ClusterOperator itself having been created at 2020-01-03T17:35:20Z [2]. Maybe the operator hit an API hiccup (e.g. due to the high-latency etcd), freaked out, and never recovered (in the ~50m between 18:25 and 19:13)? Or maybe the operator pod hung and we never noticed? There are no health checks on the pod [4] and the last log of the currently-running container is from long before 19:13: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/current.log | tail -n1 2020-01-03T18:26:20.2351066Z time="2020-01-03T18:26:20Z" level=info msg="[sync] received list of package(s) with new version, syncing CatalogSourceConfig object(s)" [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/cluster-scoped-resources/config.openshift.io/clusteroperators/marketplace.yaml [3]: https://github.com/operator-framework/operator-marketplace/blob/2cc8e419ca45e2e15183227338a5010a1462166f/pkg/status/status.go#L325 [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator-7749fd48dd-8x5hb.yaml
Looks like we get one of these every hour or so, and that it happens on 4.3 [1] and 4.4 [2] too. Although it's possible that OperatorExited is appropriate for the disruptive flow? I don't understand the specifics of the marketplace operator, but I'd expect "operator exited" to result in an "operator down" alert, not in a change to its ClusterOperator status (because relying on the operator to live long enough to write "I'm going down" into its ClusterOperator seems finicky, and the operand may be completely fine despite brief operator outages). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/96 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/25
> Looks like we get one of these every hour or so... Meant to cite [1] for that. [1] https://search.svc.ci.openshift.org/?search=clusteroperator%2Fmarketplace+changed+Available+to+False%3A+OperatorExited%3A+The+operator+has+exited&maxAge=336h&context=-1&type=all
The root cause seems to be this: https://github.com/operator-framework/operator-marketplace/blob/372ab9f8ef14c0cfa45be231f3d0f0e54d81c97c/cmd/manager/main.go#L84 Once marketplace operator starts, it checks for the availability of the clusteroperator api. It doesn't continually reconcile the presence of the api, or wait for it to become available (this is presumably intentional so that it runs fine outside of openshift). We should continually check for the presence, or block until it's available, or just crash so the pod restarts.
> It doesn't continually reconcile the presence of the api, or wait for it to become available (this is presumably intentional so that it runs fine outside of openshift). Instead of querying, you could have a command-line flag for --there-is-a-cluster-operator-api. Set that flag in the OpenShift Deployment, and die if a ClusterOperator call fails. Folks running outside of OpenShift would not set the flag, and the operator would skip the ClusterOperator updates. The flag could also be --cluster-operator-name or some such, left unset outside of OpenShift, and set to marketplace in the OpenShift Deployment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581