1940286 – Storage operator goes Progressing=True on DaemonSet rollout, despite no config changes

Bug 1940286 - Storage operator goes Progressing=True on DaemonSet rollout, despite no config changes

Summary: Storage operator goes Progressing=True on DaemonSet rollout, despite no confi...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Fabio Bertinatto
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-18 03:42 UTC by W. Trevor King
Modified:	2021-06-04 07:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1939744
Environment:
Last Closed:	2021-06-03 13:37:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-03-18 03:42:13 UTC

+++ This bug was initially created as a clone of Bug #1939744 +++

Origin recently began watching ClusterOperator conditions for surprising behavior [1].  That's turned up things like [2,3]:

  [bz-Storage] clusteroperator/storage should not change condition/Progressing
    Run #0: Failed	0s
      4 unexpected clusteroperator state transitions during e2e test run 

      storage was Progressing=false, but became Progressing=true at 2021-03-16 18:58:24.518521908 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      storage was Progressing=true, but became Progressing=false at 2021-03-16 18:59:15.884552945 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: All is well
      storage was Progressing=false, but became Progressing=true at 2021-03-16 19:00:38.749636882 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      storage was Progressing=true, but became Progressing=false at 2021-03-16 19:01:38.393173244 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: All is well

Per the API docs, however, Progressing is for:

  Progressing indicates that the operator is actively rolling out new code, propagating config changes, or otherwise moving from one steady state to another.  Operators should not report progressing when they are reconciling a previously known state.

That makes "my operand DaemonSet is not completely reconciled right now" a bit complicated, because you need to remember if it is the first attempt at reconciling the current configuration or a later attempt at reconciling the current configuration.  In this case, the 18:58 disruption seems to have been a new node coming up:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 18:58: | head -n2
  Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 18:58:23.870 I node/ip-10-0-138-230.us-west-2.compute.internal reason/NodeHasSufficientPID Node ip-10-0-138-230.us-west-2.compute.internal status is now: NodeHasSufficientPID

From the MachineSet scaling test:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 'e2e-test/\|Starting kubelet' | grep -1 'Starting kubelet'
  Mar 16 18:54:17.923 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" started
  Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 18:58:27.023 I node/ip-10-0-238-216.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 19:00:50.519 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" finishedStatus/Passed

One possibility for distinguishing between "I just bumped the DaemonSet" (ideally Progressing=True) and "it's reacting to the cluster shifting under it (ideally Progressing=False) would be storing the version string (and possibly status.observedGeneration, to account for config changes) in the ClusterOperator's status.versions [5], with names keyed by operand.  So moving from the current:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/clusteroperators.json | jq '.items[] | select(.metadata.name == "storage").sta
tus.versions'
  [
    {
      "name": "operator",
      "version": "4.8.0-0.nightly-2021-03-16-173612"
    },
    {
      "name": "AWSEBSCSIDriverOperator",
      "version": "4.8.0-0.nightly-2021-03-16-173612"
    }
  ]

To something like:

  [
    {
      "name": "operator",
      "version": "4.8.0-0.nightly-2021-03-16-173612"
    },
    {
      "name": "AWSEBSCSIDriverOperator",
      "version": "4.8.0-0.nightly-2021-03-16-173612 generation 1"
    }
    ...other operands...
  ]

[1]: https://github.com/openshift/origin/pull/25918#event-4423357757
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#release-openshift-ocp-installer-e2e-aws-serial-4.8
[3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256
[4]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L147-L151
[5]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L43-L47

--- Additional comment from W. Trevor King on 2021-03-16 23:21:53 UTC ---

Clayton points out that you should also be able to compare your current, leveled 'operator' version with your desired version to decide if you are updating.  And then... do something to see if you bumped the DaemonSet due to a config change.  If there are no operator-config knobs that feed into the DaemonSet config, then great :).  If there are, you could always record something about the most-recently-leveled config generation(s) or hashes or whatever under ClusterOperator's status.versions.  I dunno; gets a bit fiddly.

Comment 2 Jan Safranek 2021-03-26 14:45:49 UTC

This does not look like an urgent issue that blocks 4.8.

Comment 3 W. Trevor King 2021-03-28 03:30:58 UTC

Yeah, blocker- makes sense to me, regardless of severity.  I doubt this needs a backport.  It's mostly about de-noising [1] and similar, by bringing operator behavior in line with the documented API.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial

Comment 4 Fabio Bertinatto 2021-06-02 07:32:46 UTC

After the library-go patch merges, we should:

1. Bump library-go in CSI operators
2. Fix CSO to use the new function exposed in library-go to check if Deployment is Progressing

Comment 5 David Eads 2021-06-03 13:37:03 UTC

Going progressing, when the operand has not achieved a steady state is expected behavior.  PR https://github.com/openshift/library-go/pull/1073 actually takes previously functioning code and makes it incorrect by preventing the operator from going progressing in cases when the operand is not in a steady state.

This code should not change.  If there's an issue with the docs, we can fix it, but if the operand is reacting to a change in the cluster state, whether it was user initiated or not, the operator status should be progressing=true.

Note You need to log in before you can comment on or make changes to this bug.