Bug 1721537 - Marketplace operator enters degraded & ultimately exits during upgrade
Summary: Marketplace operator enters degraded & ultimately exits during upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Alexander Greene
QA Contact: Fan Jia
URL:
Whiteboard:
Depends On:
Blocks: 1740824
TreeView+ depends on / blocked
 
Reported: 2019-06-18 13:47 UTC by Justin Pierce
Modified: 2019-10-16 06:32 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: There are three separate issues that are highlighted in this bug: 1. During an upgrade, multiple marketplace operators are running and updating the ClusterOperatorStatus at the same time, making it difficult to identify the actual state of the marketplace operator. 2. The marketplace operator is reporting degraded while it is upgrading. The marketplace operator reports degraded when the number of failed syncs surpasses some threshold of total syncs. The marketplace operator currently reports a failed sync whenever an error is encountered while reconciling an operand (OperatorSources or CatalogSourceConfigs). This allows network issues or invalid operands to drive the marketplace operator to a degraded state. 3. It is difficult to identify why a ClusterOperatorStatus condition is in a given state via telemetry. Marketplace currently does not set a reason when setting conditions and telemetry only includes the state and reason. Consequence: The ClusterOperator status is not giving an accurate description of the health of the marketplace operator. Fix: Three changes need to be made to address the bugzilla: 1. Implement leader election using the functions provided by the Operator-SDK to prevent multiple marketplace operators from updating the ClusterOperatorStatus at the same time. 2. Rather than reporting a failed sync whenever an error is encountered while reconciling an operand, report a failed sync when the marketplace operator is unable to get or update an existing operand. 3. Update marketplace to include a reason when setting a condition that indicates why the condition is being set. Result: Multiple instances of marketplace will no longer update status at the same time. Marketplace will only report that it has reached a degraded state when it is unable to get/update its operands. Telemetry will provide better insights into the state of the marketplace operator.
Clone Of:
: 1740824 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:32:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
operator exit details (36.98 KB, text/plain)
2019-06-19 13:43 UTC, Justin Pierce
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-marketplace pull 214 0 'None' closed Bug 1721537: Fix degraded status on upgrade 2020-12-18 15:11:56 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:32:16 UTC

Description Justin Pierce 2019-06-18 13:47:44 UTC
Description of problem:
During an upgrade from 4.1.1 to 4.1.2, it was observed that the marketplace operator entered the degraded state and ultimately exited. The operator ultimately recovered, but architecture team observing the upgrade requested this be fixed.

The marketplace went through Available=False, Degraded=True, and even exited during the upgrade.

The precise reason for these transitions is not known, but are presumably reproducible. When the marketplace operator exited, its CRD reported:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-06-04T21:18:30Z"
  generation: 1
  name: marketplace
  resourceVersion: "6683571"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/marketplace
  uid: 4d4b5959-870e-11e9-9662-0214201dd1d8
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-06-17T18:51:46Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-06-17T18:59:32Z"
    message: The operator has exited and is no longer reporting status.
    status: "False"
    type: Available
  - lastTransitionTime: "2019-06-04T21:18:30Z"
    status: "False"
    type: Degraded
  extension: null
  versions:
  - name: operator
    version: 4.1.2

Version-Release number of selected component (if applicable):
4.1.1->4.1.2

Steps to Reproduce:
1. Upgrade a 4.1 cluster to 4.1.2. Constantly monitor the state of marketplace clusteroperator conditions.

Comment 3 Justin Pierce 2019-06-19 13:43:47 UTC
Created attachment 1582229 [details]
operator exit details

I've not seen degraded again yet, but here is output from an upgrade when the operator exited and stopped reporting status. Notice the termination due to the liveness probe. The operator stopped reporting status at least twice during the upgrade.

Comment 7 Alexander Greene 2019-07-12 19:45:41 UTC
The following PR has been opened to address this bug: https://github.com/operator-framework/operator-marketplace/pull/214

Comment 8 Fan Jia 2019-07-16 09:55:03 UTC
test env:
cv: 4.2.0-0.nightly-2019-07-15-054657 -> 4.2.0-0.nightly-2019-07-15-074808

No "message: The operator has exited and is no longer reporting status.", no restart of pod "marketplace-operator-XXXX", no "Degraded=True" during the twice updating.
Verify this bug.

Comment 10 errata-xmlrpc 2019-10-16 06:32:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.