Bug 1839239
Summary: | csi-snapshot-controller flickers Degraded=True on etcd hiccups | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> |
Storage sub component: | Operators | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | unspecified | CC: | aos-bugs, fbertina, jsafrane |
Version: | 4.4 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:12:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2020-05-22 20:04:49 UTC
> 2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed" > 2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("") I'm not sure if these timestamps represent the transition between the operator conditions. Looking at the raw events, it seems like the operator took some time to go from OK to Degraded. Here it went Degraded: > - apiVersion: v1 > count: 1 > eventTime: null > firstTimestamp: "2020-05-13T17:17:21Z" > involvedObject: > apiVersion: apps/v1 > kind: Deployment > name: csi-snapshot-controller-operator > namespace: openshift-cluster-storage-operator > uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c > kind: Event > lastTimestamp: "2020-05-13T17:17:21Z" > message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded changed > from Unknown to False ("")' > metadata:https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73 > creationTimestamp: "2020-05-13T17:17:21Z" > name: csi-snapshot-controller-operator.160ea60104a989f7 > namespace: openshift-cluster-storage-operator > resourceVersion: "36434" > selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60104a989f7 > uid: b7df8331-a613-440d-9511-e318ef57f4fb > reason: OperatorStatusChanged > reportingComponent: "" > reportingInstance: "" > source: > component: csi-snapshot-controller-operator-status-controller > type: Normal And then some time later it went OK: > - apiVersion: v1 > count: 1 > eventTime: null > firstTimestamp: "2020-05-13T17:18:02Z" > involvedObject: > apiVersion: apps/v1 > kind: Deployment > name: csi-snapshot-controller-operator > namespace: openshift-cluster-storage-operator > uid: 99d9453c-fddc-476a-a0cd-541bae0b3f2c > kind: Event > lastTimestamp: "2020-05-13T17:18:02Z" > message: 'Status for clusteroperator/csi-snapshot-controller changed: Degraded message > changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed"' > metadata: > creationTimestamp: "2020-05-13T17:18:02Z" > name: csi-snapshot-controller-operator.160ea60a6a9c5d51 > namespace: openshift-cluster-storage-operator > resourceVersion: "38022" > selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/events/csi-snapshot-controller-operator.160ea60a6a9c5d51 > uid: d971115d-0e6d-429b-a96a-2a345d087575 > reason: OperatorStatusChanged > reportingComponent: "" > reportingInstance: "" > source: > component: csi-snapshot-controller-operator-status-controller > type: Normal Also, the operator does use the inertial controller: https://github.com/openshift/cluster-csi-snapshot-controller-operator/blob/467324f0f0f48ff73ed606b64d0d618becfff584/pkg/operator/starter.go#L73 Closing this for the reasons listed in #2, but feel free to re-open if you disagree. Checking the target version from the job I linked: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-csi-snapshot-controller-operator").annotations["io.openshift.build.commit.id"]' 4492120dbebb8dd283f1ba9caec67f9f605cda1c And yes, that has the default 2m degraded inertia: $ git --no-pager grep -n 'NewClusterOperatorStatusController\| MustNewInertia' 4492120dbebb 4492120dbebb:pkg/operator/starter.go:73: clusterOperatorStatus := status.NewClusterOperatorStatusController( 4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:58:func NewClusterOperatorStatusController( 4492120dbebb:vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go:75: degradedInertia: MustNewInertia(2 * time.Minute).Inertia, Timing on the update: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-05-13T17:03:15Z 2020-05-13T17:58:48Z 4.4.4 Completed false 2020-05-13T16:06:09Z 2020-05-13T16:44:00Z 4.3.19 Completed false All the status-changed events: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/388/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-storage-operator" and .reason == "OperatorStatusChanged") | .timePrefix = if .firstTimestamp == null or .firstTimestamp == "null" then .eventTime else .firstTimestamp + " - " + .lastTimestamp + " (" + (.count | tostring) + ")" end | .timePrefix + " " + .metadata.namespace + " " + .message' 2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded set to False (""),Progressing set to Unknown (""),Available set to Unknown (""),Upgradeable set to Unknown (""),status.relatedObjects changed from [] to [{"" "namespaces" "" "openshift-cluster-storage-operator"} {"" "namespaces" "" "openshift-cluster-storage-operator"} {"operator.openshift.io" "csisnapshotcontrollers" "" "cluster"}] 2020-05-13T17:17:21Z - 2020-05-13T17:17:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded changed from Unknown to False ("") 2020-05-13T17:17:25Z - 2020-05-13T17:17:26Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: {"conditions":[{"type":"Degraded","status":"False","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"AsExpected"},{"type":"Progressing","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Available","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2020-05-13T17:17:21Z","reason":"NoData"}]," A: relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null} B: versions":[{"name":"operator","version":"4.4.4"},{"name":"csi-snapshot-controller","version":"4.4.4"}],"relatedObjects":[{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"","resource":"namespaces","name":"openshift-cluster-storage-operator"},{"group":"operator.openshift.io","resource":"csisnapshotcontrollers","name":"cluster"}],"extension":null} 2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from Unknown to True ("Progressing: Waiting for Deployment to act on changes"),Available changed from Unknown to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods"),Upgradeable changed from Unknown to True ("") 2020-05-13T17:17:26Z - 2020-05-13T17:17:26Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods" 2020-05-13T17:18:02Z - 2020-05-13T17:18:02Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "" to "Degraded: failed to sync CRDs: etcdserver: leader changed" 2020-05-13T17:18:03Z - 2020-05-13T17:18:03Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Degraded message changed from "Degraded: failed to sync CRDs: etcdserver: leader changed" to "",Available changed from False to True ("") 2020-05-13T17:18:12Z - 2020-05-13T17:18:12Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("") 2020-05-13T17:26:35Z - 2020-05-13T17:46:40Z (4) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods"),Available changed from True to False ("Available: Waiting for Deployment to deploy csi-snapshot-controller pods") 2020-05-13T17:26:36Z - 2020-05-13T17:26:36Z (1) openshift-cluster-storage-operator (combined from similar events): Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("") 2020-05-13T17:41:16Z - 2020-05-13T17:46:40Z (3) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False (""),Available changed from False to True ("") 2020-05-13T17:51:55Z - 2020-05-13T17:51:55Z (2) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes") 2020-05-13T17:51:56Z - 2020-05-13T17:51:56Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods" 2020-05-13T17:51:59Z - 2020-05-13T17:51:59Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("") So: * 17:17:21Z Degraded set to False ("") * 17:17:21Z Degraded changed from Unknown to False ("") (a race? Why is this not covered by the previous event?) * 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1]. * 17:17:26Z Progressing changed from Unknown to True, Available changed from Unknown to False. * 17:18:02Z Degraded message changed to mention etcdserver leader changed, but yeah, the status is still Degraded=False. * 17:18:12Z Progressing changed to False * 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again. Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version). Is that what was going on? So you're right, the Degraded message seems to flicker, but its status does not, thanks to inertia. I'm still concerned about the 'operator' version getting set too soon, and whatever is going on with the Available flickers. Reopening to hash those out. [1]: https://github.com/openshift/api/blob/87466835fcc055d95c3a2dc26a48a85a267129c6/config/v1/types_cluster_operator.go#L44-L45 > * 17:17:25Z and 17:17:26Z, two events setting the operator version to 4.4.4, which is supposed to be reserved for leveling at the target version after the operator and all its operands have completed the update [1]. The operator is indeed setting the versions too soon. Submitted a PR [0] to address that. > * 17:26:35Z through 17:51:59Z, several instances of Progressing going True again, and Available going False again. > Progessing=True is ok, but it's not clear to me why we'd want Available=False unless we had very serious csi-snapshot-controller Deployment issues (e.g. no available pods at any version). Is that what was going on? That's correct. As per operator source code, it only goes Available False if there's no available replicas at any version. [0] https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/57 Verified with upgrade from 4.7.0-0.nightly-2020-10-17-034503 to 4.7.0-0.nightly-2020-10-22-175439 From the csi-snapshot-controller-operator log, in the same minute, Progressing changed to False and operator version changed. I1023 07:05:21.813034 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods" I1023 07:05:27.010593 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "operator" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439" I1023 07:05:27.010614 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/csi-snapshot-controller version "csi-snapshot-controller" changed from "4.7.0-0.nightly-2020-10-17-034503" to "4.7.0-0.nightly-2020-10-22-175439" I1023 07:05:27.022611 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}] I1023 07:05:27.033244 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"csi-snapshot-controller-operator", UID:"57f2f8b4-b14f-4716-92ce-fe4366c04cba", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("") From the event, in the same minute, Progressing changed to False and operator version changed. 2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from False to True ("Progressing: Waiting for Deployment to act on changes") 2020-10-23T07:05:21Z - 2020-10-23T07:05:21Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "Progressing: Waiting for Deployment to act on changes" to "Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods" 2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: status.versions changed from [{"operator" "4.7.0-0.nightly-2020-10-17-034503"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-17-034503"}] to [{"operator" "4.7.0-0.nightly-2020-10-22-175439"} {"csi-snapshot-controller" "4.7.0-0.nightly-2020-10-22-175439"}] 2020-10-23T07:05:27Z - 2020-10-23T07:05:27Z (1) openshift-cluster-storage-operator Status for clusteroperator/csi-snapshot-controller changed: Progressing changed from True to False ("") Change status to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |