1721537 – Marketplace operator enters degraded & ultimately exits during upgrade

Bug 1721537 - Marketplace operator enters degraded & ultimately exits during upgrade

Summary: Marketplace operator enters degraded & ultimately exits during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Alexander Greene
QA Contact:	Fan Jia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1740824
TreeView+	depends on / blocked

Reported:	2019-06-18 13:47 UTC by Justin Pierce
Modified:	2019-10-16 06:32 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: There are three separate issues that are highlighted in this bug: 1. During an upgrade, multiple marketplace operators are running and updating the ClusterOperatorStatus at the same time, making it difficult to identify the actual state of the marketplace operator. 2. The marketplace operator is reporting degraded while it is upgrading. The marketplace operator reports degraded when the number of failed syncs surpasses some threshold of total syncs. The marketplace operator currently reports a failed sync whenever an error is encountered while reconciling an operand (OperatorSources or CatalogSourceConfigs). This allows network issues or invalid operands to drive the marketplace operator to a degraded state. 3. It is difficult to identify why a ClusterOperatorStatus condition is in a given state via telemetry. Marketplace currently does not set a reason when setting conditions and telemetry only includes the state and reason. Consequence: The ClusterOperator status is not giving an accurate description of the health of the marketplace operator. Fix: Three changes need to be made to address the bugzilla: 1. Implement leader election using the functions provided by the Operator-SDK to prevent multiple marketplace operators from updating the ClusterOperatorStatus at the same time. 2. Rather than reporting a failed sync whenever an error is encountered while reconciling an operand, report a failed sync when the marketplace operator is unable to get or update an existing operand. 3. Update marketplace to include a reason when setting a condition that indicates why the condition is being set. Result: Multiple instances of marketplace will no longer update status at the same time. Marketplace will only report that it has reached a degraded state when it is unable to get/update its operands. Telemetry will provide better insights into the state of the marketplace operator.
Clone Of:
Clones:	1740824 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:32:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
operator exit details (36.98 KB, text/plain) 2019-06-19 13:43 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-marketplace pull 214	0	'None'	closed	Bug 1721537: Fix degraded status on upgrade	2020-12-18 15:11:56 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:16 UTC

Description Justin Pierce 2019-06-18 13:47:44 UTC

Description of problem:
During an upgrade from 4.1.1 to 4.1.2, it was observed that the marketplace operator entered the degraded state and ultimately exited. The operator ultimately recovered, but architecture team observing the upgrade requested this be fixed.

The marketplace went through Available=False, Degraded=True, and even exited during the upgrade.

The precise reason for these transitions is not known, but are presumably reproducible. When the marketplace operator exited, its CRD reported:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-06-04T21:18:30Z"
  generation: 1
  name: marketplace
  resourceVersion: "6683571"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/marketplace
  uid: 4d4b5959-870e-11e9-9662-0214201dd1d8
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-06-17T18:51:46Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-06-17T18:59:32Z"
    message: The operator has exited and is no longer reporting status.
    status: "False"
    type: Available
  - lastTransitionTime: "2019-06-04T21:18:30Z"
    status: "False"
    type: Degraded
  extension: null
  versions:
  - name: operator
    version: 4.1.2

Version-Release number of selected component (if applicable):
4.1.1->4.1.2

Steps to Reproduce:
1. Upgrade a 4.1 cluster to 4.1.2. Constantly monitor the state of marketplace clusteroperator conditions.

Comment 3 Justin Pierce 2019-06-19 13:43:47 UTC

Created attachment 1582229 [details]
operator exit details

I've not seen degraded again yet, but here is output from an upgrade when the operator exited and stopped reporting status. Notice the termination due to the liveness probe. The operator stopped reporting status at least twice during the upgrade.

Comment 7 Alexander Greene 2019-07-12 19:45:41 UTC

The following PR has been opened to address this bug: https://github.com/operator-framework/operator-marketplace/pull/214

Comment 8 Fan Jia 2019-07-16 09:55:03 UTC

test env:
cv: 4.2.0-0.nightly-2019-07-15-054657 -> 4.2.0-0.nightly-2019-07-15-074808

No "message: The operator has exited and is no longer reporting status.", no restart of pod "marketplace-operator-XXXX", no "Degraded=True" during the twice updating.
Verify this bug.

Comment 10 errata-xmlrpc 2019-10-16 06:32:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.