Cause:
There are three separate issues that are highlighted in this bug:
1. During an upgrade, multiple marketplace operators are running and updating the ClusterOperatorStatus at the same time, making it difficult to identify the actual state of the marketplace operator.
2. The marketplace operator is reporting degraded while it is upgrading. The marketplace operator reports degraded when the number of failed syncs surpasses some threshold of total syncs. The marketplace operator currently reports a failed sync whenever an error is encountered while reconciling an operand (OperatorSources or CatalogSourceConfigs). This allows network issues or invalid operands to drive the marketplace operator to a degraded state.
3. It is difficult to identify why a ClusterOperatorStatus condition is in a given state via telemetry. Marketplace currently does not set a reason when setting conditions and telemetry only includes the state and reason.
Consequence:
The ClusterOperator status is not giving an accurate description of the health of the marketplace operator.
Fix:
Three changes need to be made to address the bugzilla:
1. Implement leader election using the functions provided by the Operator-SDK to prevent multiple marketplace operators from updating the ClusterOperatorStatus at the same time.
2. Rather than reporting a failed sync whenever an error is encountered while reconciling an operand, report a failed sync when the marketplace operator is unable to get or update an existing operand.
3. Update marketplace to include a reason when setting a condition that indicates why the condition is being set.
Result:
Multiple instances of marketplace will no longer update status at the same time.
Marketplace will only report that it has reached a degraded state when it is unable to get/update its operands.
Telemetry will provide better insights into the state of the marketplace operator.
Description of problem:
During an upgrade from 4.1.1 to 4.1.2, it was observed that the marketplace operator entered the degraded state and ultimately exited. The operator ultimately recovered, but architecture team observing the upgrade requested this be fixed.
The marketplace went through Available=False, Degraded=True, and even exited during the upgrade.
The precise reason for these transitions is not known, but are presumably reproducible. When the marketplace operator exited, its CRD reported:
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2019-06-04T21:18:30Z"
generation: 1
name: marketplace
resourceVersion: "6683571"
selfLink: /apis/config.openshift.io/v1/clusteroperators/marketplace
uid: 4d4b5959-870e-11e9-9662-0214201dd1d8
spec: {}
status:
conditions:
- lastTransitionTime: "2019-06-17T18:51:46Z"
status: "False"
type: Progressing
- lastTransitionTime: "2019-06-17T18:59:32Z"
message: The operator has exited and is no longer reporting status.
status: "False"
type: Available
- lastTransitionTime: "2019-06-04T21:18:30Z"
status: "False"
type: Degraded
extension: null
versions:
- name: operator
version: 4.1.2
Version-Release number of selected component (if applicable):
4.1.1->4.1.2
Steps to Reproduce:
1. Upgrade a 4.1 cluster to 4.1.2. Constantly monitor the state of marketplace clusteroperator conditions.
Created attachment 1582229[details]
operator exit details
I've not seen degraded again yet, but here is output from an upgrade when the operator exited and stopped reporting status. Notice the termination due to the liveness probe. The operator stopped reporting status at least twice during the upgrade.
test env:
cv: 4.2.0-0.nightly-2019-07-15-054657 -> 4.2.0-0.nightly-2019-07-15-074808
No "message: The operator has exited and is no longer reporting status.", no restart of pod "marketplace-operator-XXXX", no "Degraded=True" during the twice updating.
Verify this bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:2922
Description of problem: During an upgrade from 4.1.1 to 4.1.2, it was observed that the marketplace operator entered the degraded state and ultimately exited. The operator ultimately recovered, but architecture team observing the upgrade requested this be fixed. The marketplace went through Available=False, Degraded=True, and even exited during the upgrade. The precise reason for these transitions is not known, but are presumably reproducible. When the marketplace operator exited, its CRD reported: apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-06-04T21:18:30Z" generation: 1 name: marketplace resourceVersion: "6683571" selfLink: /apis/config.openshift.io/v1/clusteroperators/marketplace uid: 4d4b5959-870e-11e9-9662-0214201dd1d8 spec: {} status: conditions: - lastTransitionTime: "2019-06-17T18:51:46Z" status: "False" type: Progressing - lastTransitionTime: "2019-06-17T18:59:32Z" message: The operator has exited and is no longer reporting status. status: "False" type: Available - lastTransitionTime: "2019-06-04T21:18:30Z" status: "False" type: Degraded extension: null versions: - name: operator version: 4.1.2 Version-Release number of selected component (if applicable): 4.1.1->4.1.2 Steps to Reproduce: 1. Upgrade a 4.1 cluster to 4.1.2. Constantly monitor the state of marketplace clusteroperator conditions.