Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1721537

Summary: Marketplace operator enters degraded & ultimately exits during upgrade
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: OLMAssignee: Alexander Greene <agreene>
OLM sub component: OperatorHub QA Contact: Fan Jia <jfan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium    
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: There are three separate issues that are highlighted in this bug: 1. During an upgrade, multiple marketplace operators are running and updating the ClusterOperatorStatus at the same time, making it difficult to identify the actual state of the marketplace operator. 2. The marketplace operator is reporting degraded while it is upgrading. The marketplace operator reports degraded when the number of failed syncs surpasses some threshold of total syncs. The marketplace operator currently reports a failed sync whenever an error is encountered while reconciling an operand (OperatorSources or CatalogSourceConfigs). This allows network issues or invalid operands to drive the marketplace operator to a degraded state. 3. It is difficult to identify why a ClusterOperatorStatus condition is in a given state via telemetry. Marketplace currently does not set a reason when setting conditions and telemetry only includes the state and reason. Consequence: The ClusterOperator status is not giving an accurate description of the health of the marketplace operator. Fix: Three changes need to be made to address the bugzilla: 1. Implement leader election using the functions provided by the Operator-SDK to prevent multiple marketplace operators from updating the ClusterOperatorStatus at the same time. 2. Rather than reporting a failed sync whenever an error is encountered while reconciling an operand, report a failed sync when the marketplace operator is unable to get or update an existing operand. 3. Update marketplace to include a reason when setting a condition that indicates why the condition is being set. Result: Multiple instances of marketplace will no longer update status at the same time. Marketplace will only report that it has reached a degraded state when it is unable to get/update its operands. Telemetry will provide better insights into the state of the marketplace operator.
Story Points: ---
Clone Of:
: 1740824 (view as bug list) Environment:
Last Closed: 2019-10-16 06:32:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1740824    
Attachments:
Description Flags
operator exit details none

Description Justin Pierce 2019-06-18 13:47:44 UTC
Description of problem:
During an upgrade from 4.1.1 to 4.1.2, it was observed that the marketplace operator entered the degraded state and ultimately exited. The operator ultimately recovered, but architecture team observing the upgrade requested this be fixed.

The marketplace went through Available=False, Degraded=True, and even exited during the upgrade.

The precise reason for these transitions is not known, but are presumably reproducible. When the marketplace operator exited, its CRD reported:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-06-04T21:18:30Z"
  generation: 1
  name: marketplace
  resourceVersion: "6683571"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/marketplace
  uid: 4d4b5959-870e-11e9-9662-0214201dd1d8
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-06-17T18:51:46Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-06-17T18:59:32Z"
    message: The operator has exited and is no longer reporting status.
    status: "False"
    type: Available
  - lastTransitionTime: "2019-06-04T21:18:30Z"
    status: "False"
    type: Degraded
  extension: null
  versions:
  - name: operator
    version: 4.1.2

Version-Release number of selected component (if applicable):
4.1.1->4.1.2

Steps to Reproduce:
1. Upgrade a 4.1 cluster to 4.1.2. Constantly monitor the state of marketplace clusteroperator conditions.

Comment 3 Justin Pierce 2019-06-19 13:43:47 UTC
Created attachment 1582229 [details]
operator exit details

I've not seen degraded again yet, but here is output from an upgrade when the operator exited and stopped reporting status. Notice the termination due to the liveness probe. The operator stopped reporting status at least twice during the upgrade.

Comment 7 Alexander Greene 2019-07-12 19:45:41 UTC
The following PR has been opened to address this bug: https://github.com/operator-framework/operator-marketplace/pull/214

Comment 8 Fan Jia 2019-07-16 09:55:03 UTC
test env:
cv: 4.2.0-0.nightly-2019-07-15-054657 -> 4.2.0-0.nightly-2019-07-15-074808

No "message: The operator has exited and is no longer reporting status.", no restart of pod "marketplace-operator-XXXX", no "Degraded=True" during the twice updating.
Verify this bug.

Comment 10 errata-xmlrpc 2019-10-16 06:32:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922