Bug 1787660 - marketplace OperatorExited blocks update
Summary: marketplace OperatorExited blocks update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Ben Luddy
QA Contact: Fan Jia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-03 21:58 UTC by W. Trevor King
Modified: 2020-05-04 11:22 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:22:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-marketplace pull 280 0 None closed Bug 1787660: Add CLI flag to set ClusterOperator name. 2020-12-11 03:51:00 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:22:40 UTC

Description W. Trevor King 2020-01-03 21:58:40 UTC
One 4.2.4 -> 4.2.13 Azure CI job failed with [1]:

  Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator marketplace has not yet reported success
  ...
  failed: (1h21m18s) 2020-01-03T19:13:37 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"

The marketplace ClusterOperator had [2]:

  - lastTransitionTime: 2020-01-03T18:22:54Z
    message: The operator has exited
    reason: OperatorExited
    status: "False"
    type: Available

The code for that OperatorExited is [3], and seems to be tied to graceful shutdowns.  There was some low-grade etcd latency issues due to Azure's slower disks:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/build-log.txt | grep -c 'etcdserver: request timed out'
  2

The marketplace-operator pod had restarted once at 2020-01-03T18:24:01Z
 [4].  The previous container exited 2 [4], but it's logs don't say why:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/previous.log | tail -n1
  2020-01-03T18:23:40.5175711Z time="2020-01-03T18:23:40Z" level=info msg="Config API is available"

The current container has the very strange claim about the ClusterOperator API:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/current.log | grep -1 status
  2020-01-03T18:25:07.5795305Z time="2020-01-03T18:25:07Z" level=info msg="[defaults] OperatorSource redhat-operators default and on cluster specs are same"
  2020-01-03T18:25:07.5797361Z time="2020-01-03T18:25:07Z" level=warning msg="[status] ClusterOperator API not present"
  2020-01-03T18:25:20.212361Z time="2020-01-03T18:25:20Z" level=warning msg="[status] ClusterOperator API not present"
  2020-01-03T18:25:20.2125821Z time="2020-01-03T18:25:20Z" level=info msg="[sync] Operator source sync loop will start after 1m0s"

despite the marketplace ClusterOperator itself having been created at 2020-01-03T17:35:20Z [2].  Maybe the operator hit an API hiccup (e.g. due to the high-latency etcd), freaked out, and never recovered (in the ~50m between 18:25 and 19:13)?  Or maybe the operator pod hung and we never noticed?  There are no health checks on the pod [4] and the last log of the currently-running container is from long before 19:13:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator/marketplace-operator/logs/current.log | tail -n1
  2020-01-03T18:26:20.2351066Z time="2020-01-03T18:26:20Z" level=info msg="[sync] received list of package(s) with new version, syncing CatalogSourceConfig object(s)"

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/cluster-scoped-resources/config.openshift.io/clusteroperators/marketplace.yaml
[3]: https://github.com/operator-framework/operator-marketplace/blob/2cc8e419ca45e2e15183227338a5010a1462166f/pkg/status/status.go#L325
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/10/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-335d7e63146bbd3bb903fb2fd1df042f450661e45be36cef89996badc29bbaf9/namespaces/openshift-marketplace/pods/marketplace-operator-7749fd48dd-8x5hb/marketplace-operator-7749fd48dd-8x5hb.yaml

Comment 1 W. Trevor King 2020-01-03 22:03:18 UTC
Looks like we get one of these every hour or so, and that it happens on 4.3 [1] and 4.4 [2] too.  Although it's possible that OperatorExited is appropriate for the disruptive flow?  I don't understand the specifics of the marketplace operator, but I'd expect "operator exited" to result in an "operator down" alert, not in a change to its ClusterOperator status (because relying on the operator to live long enough to write "I'm going down" into its ClusterOperator seems finicky, and the operand may be completely fine despite brief operator outages).

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/96
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/25

Comment 2 W. Trevor King 2020-01-03 22:03:58 UTC
> Looks like we get one of these every hour or so...

Meant to cite [1] for that.

[1] https://search.svc.ci.openshift.org/?search=clusteroperator%2Fmarketplace+changed+Available+to+False%3A+OperatorExited%3A+The+operator+has+exited&maxAge=336h&context=-1&type=all

Comment 3 Evan Cordell 2020-01-03 22:55:02 UTC
The root cause seems to be this: https://github.com/operator-framework/operator-marketplace/blob/372ab9f8ef14c0cfa45be231f3d0f0e54d81c97c/cmd/manager/main.go#L84

Once marketplace operator starts, it checks for the availability of the clusteroperator api. It doesn't continually reconcile the presence of the api, or wait for it to become available (this is presumably intentional so that it runs fine outside of openshift).

We should continually check for the presence, or block until it's available, or just crash so the pod restarts.

Comment 4 W. Trevor King 2020-01-04 00:28:06 UTC
> It doesn't continually reconcile the presence of the api, or wait for it to become available (this is presumably intentional so that it runs fine outside of openshift).

Instead of querying, you could have a command-line flag for --there-is-a-cluster-operator-api.  Set that flag in the OpenShift Deployment, and die if a ClusterOperator call fails.  Folks running outside of OpenShift would not set the flag, and the operator would skip the ClusterOperator updates.  The flag could also be --cluster-operator-name or some such, left unset outside of OpenShift, and set to marketplace in the OpenShift Deployment.

Comment 11 errata-xmlrpc 2020-05-04 11:22:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.