Bug 1839239

Summary:	csi-snapshot-controller flickers Degraded=True on etcd hiccups
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Storage	Assignee:	Fabio Bertinatto <fbertina>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	aos-bugs, fbertina, jsafrane
Version:	4.4	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:12:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-05-22 20:04:49 UTC

From a 4.3.19 -> 4.4.4 CI job [1,2]:

$ yaml2json <namespaces/openshift-cluster-storage-operator/core/events.yaml | jq -r '[.items[] | select(.reason == "OperatorStatusChanged") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message'
...
2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")
...

Ideally the Kube API does not flake, but the operator the operator should accept a second of Kube-API flake without going Degraded=True.  From the API docs [3]:

> Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.  The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state.

library-go has inertial status controllers (e.g. [4]) to deliver that behavior.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/must-gather.tar
[3]: https://github.com/openshift/api/blob/17ada6e4245b4cc7a1f81ec04228e9093ec4f89e/config/v1/types_cluster_operator.go#L152-L156
[4]: https://github.com/openshift/library-go/blob/eeebfaa62843201e865e96a61eb961ee2c446775/pkg/operator/status/status_controller_test.go#L303-L313

Comment 2 Fabio Bertinatto 2020-08-11 13:14:17 UTC

> 2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
> 2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")

I'm not sure if these timestamps represent the transition between the operator conditions.

Looking at the raw events, it seems like the operator took some time to go from OK to Degraded.

Here it went Degraded:

> - apiVersion: v1
>   count: 1
>   eventTime: null
>   firstTimestamp: "2020-05-13T17:17:21Z"
>   involvedObject:
>     apiVersion: apps/v1
>     kind: Deployment
>     name: csi-snapshot-controller-operator
>     namespace: openshift-cluster-storage-operator
>     uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c
>   kind: Event
>   lastTimestamp: "2020-05-13T17:17:21Z"
>   message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded changed
>     from Unknown to False ("")'
>   metadata:https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73
>     creationTimestamp: "2020-05-13T17:17:21Z"
>     name: csi-snapshot-controller-operator.160ea60104a989f7
>     namespace: openshift-cluster-storage-operator
>     resourceVersion: "36434"
>     selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60104a989f7
>     uid: b7df8331-a613-440d-9511-e318ef57f4fb
>   reason: OperatorStatusChanged
>   reportingComponent: ""
>   reportingInstance: ""
>   source:
>     component: csi-snapshot-controller-operator-status-controller
>   type: Normal

And then some time later it went OK:

> - apiVersion: v1
>   count: 1
>   eventTime: null
>   firstTimestamp: "2020-05-13T17:18:02Z"
>   involvedObject:
>     apiVersion: apps/v1
>     kind: Deployment
>     name: csi-snapshot-controller-operator
>     namespace: openshift-cluster-storage-operator
>     uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c
>   kind: Event
>   lastTimestamp: "2020-05-13T17:18:02Z"
>   message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded message
>     changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"'
>   metadata:
>     creationTimestamp: "2020-05-13T17:18:02Z"
>     name: csi-snapshot-controller-operator.160ea60a6a9c5d51
>     namespace: openshift-cluster-storage-operator
>     resourceVersion: "38022"
>     selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60a6a9c5d51
>     uid: d971115d-0e6d-429b-a96a-2a345d087575
>   reason: OperatorStatusChanged
>   reportingComponent: ""
>   reportingInstance: ""
>   source:
>     component: csi-snapshot-controller-operator-status-controller
>   type: Normal

Also, the operator does use the inertial controller: https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73

Comment 3 Fabio Bertinatto 2020-08-11 13:15:36 UTC

Closing this for the reasons listed in #2, but feel free to re-open if you disagree.

Comment 4 W. Trevor King 2020-08-11 21:48:13 UTC

Checking the target version from the job I linked:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-csi-snapshot-controller-operator").annotations["io.openshift.build.commit.id"]'
4492120dbebb8dd283f1ba9caec67f9f605cda1c

And yes, that has the default 2m degraded inertia:

$ git --no-pager grep -n 'NewClusterOperatorStatusController\|  MustNewInertia' 4492120dbebb
4492120dbebb:pkg/operator/starter.go:73:	clusterOperatorStatus := status.NewClusterOperatorStatusController(
4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:58:func NewClusterOperatorStatusController(
4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:75:		degradedInertia:       MustNewInertia(2 * time.Minute).Inertia,

Timing on the update:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-05-13T17:03:15Z 2020-05-13T17:58:48Z 4.4.4 Completed false
2020-05-13T16:06:09Z 2020-05-13T16:44:00Z 4.3.19 Completed false

All the status-changed events:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-storage-operator" and .reason == "OperatorStatusChanged") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end | .timePrefix + " " + .metadata.namespace + " " + .message'
2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded set to False (""),Progressing set to Unknown (""),Available set to Unknown (""),Upgradeable set to Unknown (""),status.relatedObjects changed from [] to [{"" "namespaces" "" "openshift-cluster-storage-operator"} {"" "namespaces" "" "openshift-cluster-storage-operator"} {"operator.openshift.io" "csisnapshotcontrollers" "" "cluster"}]
2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded changed from Unknown to False ("")
2020-05-13T17:17:25Z - 2020-05-13T17:17:26Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: {"conditions":[{"type":"Degraded","status":"False","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"AsExpected"},{"type":"Progressing","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Available","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"}],"

A: relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null}


B: versions":[{"name":"operator","version":"4.4.4"},{"name":"csi-snapshot-controller","version":"4.4.4"}],"relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null}



2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from Unknown to True ("Progressing: Waiting for Deployment to act on changes"),Available changed from Unknown to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods"),Upgradeable changed from Unknown to True ("")
2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")
2020-05-13T17:18:12Z - 2020-05-13T17:18:12Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")
2020-05-13T17:26:35Z - 2020-05-13T17:46:40Z (4) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"),Available changed from True to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods")
2020-05-13T17:26:36Z - 2020-05-13T17:26:36Z (1) openshift-cluster-storage-operator (combined from similar events): Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("")
2020-05-13T17:41:16Z - 2020-05-13T17:46:40Z (3) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("")
2020-05-13T17:51:55Z - 2020-05-13T17:51:55Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes")
2020-05-13T17:51:56Z - 2020-05-13T17:51:56Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-05-13T17:51:59Z - 2020-05-13T17:51:59Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

So:

* 17:17:21Z Degraded set to False ("")
* 17:17:21Z Degraded changed from Unknown to False ("") (a race?  Why is this not covered by the previous event?)
* 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1].
* 17:17:26Z Progressing changed from Unknown to True, Available changed from Unknown to False.
* 17:18:02Z Degraded message changed to mention etcdserver leader changed, but yeah, the status is still Degraded=False.
* 17:18:12Z Progressing changed to False
* 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again.  Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version).  Is that what was going on?

So you're right, the Degraded message seems to flicker, but its status does not, thanks to inertia.  I'm still concerned about the 'operator' version getting set too soon, and whatever is going on with the Available flickers.  Reopening to hash those out.

[1]: https://github.com/openshift/api/blob/87466835fcc055d95c3a2dc26a48a85a267129c6/config/v1/types_cluster_operator.go#L44-L45

Comment 5 Fabio Bertinatto 2020-10-07 12:13:22 UTC

> * 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1].

The operator is indeed setting the versions too soon. Submitted a PR [0] to address that.

> * 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again.
> Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version).  Is that what was going on?

That's correct. As per operator source code, it only goes Available False if there's no available replicas at any version.
 
[0] https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/57

Comment 7 Wei Duan 2020-10-23 09:46:16 UTC

Verified with upgrade from 4.7.0-0.nightly-2020-10-17-034503 to 4.7.0-0.nightly-2020-10-22-175439

From the csi-snapshot-controller-operator log, in the same minute, Progressing changed to False and operator version changed.
I1023 07:05:21.813034       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
I1023 07:05:27.010593       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "operator" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439"
I1023 07:05:27.010614       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "csi-snapshot-controller" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439"
I1023 07:05:27.022611       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}]
I1023 07:05:27.033244       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

From the event, in the same minute, Progressing changed to False and operator version changed.
2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes")
2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}]
2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

Change status to Verified.

Comment 10 errata-xmlrpc 2021-02-24 15:12:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633