Bug 1839239 - csi-snapshot-controller flickers Degraded=True on etcd hiccups
Summary: csi-snapshot-controller flickers Degraded=True on etcd hiccups
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.7.0
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-22 20:04 UTC by W. Trevor King
Modified: 2021-02-24 15:13 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:12:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-csi-snapshot-controller-operator pull 57 0 None closed Bug 1839239: Update version only after having rolled out the new operand 2020-12-03 14:00:30 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:13:02 UTC

Description W. Trevor King 2020-05-22 20:04:49 UTC
From a 4.3.19 -> 4.4.4 CI job [1,2]:

$ yaml2json <namespaces/openshift-cluster-storage-operator/core/events.yaml | jq -r '[.items[] | select(.reason == "OperatorStatusChanged") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end] | sort_by(.timePrefix)[] | .timePrefix + " " + .metadata.namespace + " " + .message'
...
2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")
...

Ideally the Kube API does not flake, but the operator the operator should accept a second of Kube-API flake without going Degraded=True.  From the API docs [3]:

> Degraded indicates that the operator's current state does not match its desired state over a period of time resulting in a lower quality of service.  The period of time may vary by component, but a Degraded state represents persistent observation of a condition.  As a result, a component should not oscillate in and out of Degraded state.

library-go has inertial status controllers (e.g. [4]) to deliver that behavior.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/must-gather.tar
[3]: https://github.com/openshift/api/blob/17ada6e4245b4cc7a1f81ec04228e9093ec4f89e/config/v1/types_cluster_operator.go#L152-L156
[4]: https://github.com/openshift/library-go/blob/eeebfaa62843201e865e96a61eb961ee2c446775/pkg/operator/status/status_controller_test.go#L303-L313

Comment 2 Fabio Bertinatto 2020-08-11 13:14:17 UTC
> 2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
> 2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")

I'm not sure if these timestamps represent the transition between the operator conditions.

Looking at the raw events, it seems like the operator took some time to go from OK to Degraded.

Here it went Degraded:

> - apiVersion: v1
>   count: 1
>   eventTime: null
>   firstTimestamp: "2020-05-13T17:17:21Z"
>   involvedObject:
>     apiVersion: apps/v1
>     kind: Deployment
>     name: csi-snapshot-controller-operator
>     namespace: openshift-cluster-storage-operator
>     uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c
>   kind: Event
>   lastTimestamp: "2020-05-13T17:17:21Z"
>   message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded changed
>     from Unknown to False ("")'
>   metadata:https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73
>     creationTimestamp: "2020-05-13T17:17:21Z"
>     name: csi-snapshot-controller-operator.160ea60104a989f7
>     namespace: openshift-cluster-storage-operator
>     resourceVersion: "36434"
>     selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60104a989f7
>     uid: b7df8331-a613-440d-9511-e318ef57f4fb
>   reason: OperatorStatusChanged
>   reportingComponent: ""
>   reportingInstance: ""
>   source:
>     component: csi-snapshot-controller-operator-status-controller
>   type: Normal

And then some time later it went OK:

> - apiVersion: v1
>   count: 1
>   eventTime: null
>   firstTimestamp: "2020-05-13T17:18:02Z"
>   involvedObject:
>     apiVersion: apps/v1
>     kind: Deployment
>     name: csi-snapshot-controller-operator
>     namespace: openshift-cluster-storage-operator
>     uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c
>   kind: Event
>   lastTimestamp: "2020-05-13T17:18:02Z"
>   message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded message
>     changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"'
>   metadata:
>     creationTimestamp: "2020-05-13T17:18:02Z"
>     name: csi-snapshot-controller-operator.160ea60a6a9c5d51
>     namespace: openshift-cluster-storage-operator
>     resourceVersion: "38022"
>     selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60a6a9c5d51
>     uid: d971115d-0e6d-429b-a96a-2a345d087575
>   reason: OperatorStatusChanged
>   reportingComponent: ""
>   reportingInstance: ""
>   source:
>     component: csi-snapshot-controller-operator-status-controller
>   type: Normal

Also, the operator does use the inertial controller: https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73

Comment 3 Fabio Bertinatto 2020-08-11 13:15:36 UTC
Closing this for the reasons listed in #2, but feel free to re-open if you disagree.

Comment 4 W. Trevor King 2020-08-11 21:48:13 UTC
Checking the target version from the job I linked:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-csi-snapshot-controller-operator").annotations["io.openshift.build.commit.id"]'
4492120dbebb8dd283f1ba9caec67f9f605cda1c

And yes, that has the default 2m degraded inertia:

$ git --no-pager grep -n 'NewClusterOperatorStatusController\|  MustNewInertia' 4492120dbebb
4492120dbebb:pkg/operator/starter.go:73:	clusterOperatorStatus := status.NewClusterOperatorStatusController(
4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:58:func NewClusterOperatorStatusController(
4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:75:		degradedInertia:       MustNewInertia(2 * time.Minute).Inertia,

Timing on the update:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-05-13T17:03:15Z 2020-05-13T17:58:48Z 4.4.4 Completed false
2020-05-13T16:06:09Z 2020-05-13T16:44:00Z 4.3.19 Completed false

All the status-changed events:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-storage-operator" and .reason == "OperatorStatusChanged") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end | .timePrefix + " " + .metadata.namespace + " " + .message'
2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded set to False (""),Progressing set to Unknown (""),Available set to Unknown (""),Upgradeable set to Unknown (""),status.relatedObjects changed from [] to [{"" "namespaces" "" "openshift-cluster-storage-operator"} {"" "namespaces" "" "openshift-cluster-storage-operator"} {"operator.openshift.io" "csisnapshotcontrollers" "" "cluster"}]
2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded changed from Unknown to False ("")
2020-05-13T17:17:25Z - 2020-05-13T17:17:26Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: {"conditions":[{"type":"Degraded","status":"False","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"AsExpected"},{"type":"Progressing","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Available","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"}],"

A: relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null}


B: versions":[{"name":"operator","version":"4.4.4"},{"name":"csi-snapshot-controller","version":"4.4.4"}],"relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null}



2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from Unknown to True ("Progressing: Waiting for Deployment to act on changes"),Available changed from Unknown to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods"),Upgradeable changed from Unknown to True ("")
2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"
2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("")
2020-05-13T17:18:12Z - 2020-05-13T17:18:12Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")
2020-05-13T17:26:35Z - 2020-05-13T17:46:40Z (4) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"),Available changed from True to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods")
2020-05-13T17:26:36Z - 2020-05-13T17:26:36Z (1) openshift-cluster-storage-operator (combined from similar events): Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("")
2020-05-13T17:41:16Z - 2020-05-13T17:46:40Z (3) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("")
2020-05-13T17:51:55Z - 2020-05-13T17:51:55Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes")
2020-05-13T17:51:56Z - 2020-05-13T17:51:56Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-05-13T17:51:59Z - 2020-05-13T17:51:59Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

So:

* 17:17:21Z Degraded set to False ("")
* 17:17:21Z Degraded changed from Unknown to False ("") (a race?  Why is this not covered by the previous event?)
* 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1].
* 17:17:26Z Progressing changed from Unknown to True, Available changed from Unknown to False.
* 17:18:02Z Degraded message changed to mention etcdserver leader changed, but yeah, the status is still Degraded=False.
* 17:18:12Z Progressing changed to False
* 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again.  Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version).  Is that what was going on?

So you're right, the Degraded message seems to flicker, but its status does not, thanks to inertia.  I'm still concerned about the 'operator' version getting set too soon, and whatever is going on with the Available flickers.  Reopening to hash those out.

[1]: https://github.com/openshift/api/blob/87466835fcc055d95c3a2dc26a48a85a267129c6/config/v1/types_cluster_operator.go#L44-L45

Comment 5 Fabio Bertinatto 2020-10-07 12:13:22 UTC
> * 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1].

The operator is indeed setting the versions too soon. Submitted a PR [0] to address that.

> * 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again.
> Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version).  Is that what was going on?

That's correct. As per operator source code, it only goes Available False if there's no available replicas at any version.
 
[0] https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/57

Comment 7 Wei Duan 2020-10-23 09:46:16 UTC
Verified with upgrade from 4.7.0-0.nightly-2020-10-17-034503 to 4.7.0-0.nightly-2020-10-22-175439

From the csi-snapshot-controller-operator log, in the same minute, Progressing changed to False and operator version changed.
I1023 07:05:21.813034       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
I1023 07:05:27.010593       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "operator" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439"
I1023 07:05:27.010614       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "csi-snapshot-controller" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439"
I1023 07:05:27.022611       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}]
I1023 07:05:27.033244       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

From the event, in the same minute, Progressing changed to False and operator version changed.
2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes")
2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}]
2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("")

Change status to Verified.

Comment 10 errata-xmlrpc 2021-02-24 15:12:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.