+++ This bug was initially created as a clone of Bug #1939744 +++ Origin recently began watching ClusterOperator conditions for surprising behavior [1]. That's turned up things like [2,3]: [bz-Storage] clusteroperator/storage should not change condition/Progressing Run #0: Failed 0s 4 unexpected clusteroperator state transitions during e2e test run storage was Progressing=false, but became Progressing=true at 2021-03-16 18:58:24.518521908 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods storage was Progressing=true, but became Progressing=false at 2021-03-16 18:59:15.884552945 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: All is well storage was Progressing=false, but became Progressing=true at 2021-03-16 19:00:38.749636882 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods storage was Progressing=true, but became Progressing=false at 2021-03-16 19:01:38.393173244 +0000 UTC -- AWSEBSCSIDriverOperatorCRProgressing: All is well Per the API docs, however, Progressing is for: Progressing indicates that the operator is actively rolling out new code, propagating config changes, or otherwise moving from one steady state to another. Operators should not report progressing when they are reconciling a previously known state. That makes "my operand DaemonSet is not completely reconciled right now" a bit complicated, because you need to remember if it is the first attempt at reconciling the current configuration or a later attempt at reconciling the current configuration. In this case, the 18:58 disruption seems to have been a new node coming up: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 18:58: | head -n2 Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet. Mar 16 18:58:23.870 I node/ip-10-0-138-230.us-west-2.compute.internal reason/NodeHasSufficientPID Node ip-10-0-138-230.us-west-2.compute.internal status is now: NodeHasSufficientPID From the MachineSet scaling test: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 'e2e-test/\|Starting kubelet' | grep -1 'Starting kubelet' Mar 16 18:54:17.923 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" started Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet. Mar 16 18:58:27.023 I node/ip-10-0-238-216.us-west-2.compute.internal reason/Starting Starting kubelet. Mar 16 19:00:50.519 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" finishedStatus/Passed One possibility for distinguishing between "I just bumped the DaemonSet" (ideally Progressing=True) and "it's reacting to the cluster shifting under it (ideally Progressing=False) would be storing the version string (and possibly status.observedGeneration, to account for config changes) in the ClusterOperator's status.versions [5], with names keyed by operand. So moving from the current: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/clusteroperators.json | jq '.items[] | select(.metadata.name == "storage").sta tus.versions' [ { "name": "operator", "version": "4.8.0-0.nightly-2021-03-16-173612" }, { "name": "AWSEBSCSIDriverOperator", "version": "4.8.0-0.nightly-2021-03-16-173612" } ] To something like: [ { "name": "operator", "version": "4.8.0-0.nightly-2021-03-16-173612" }, { "name": "AWSEBSCSIDriverOperator", "version": "4.8.0-0.nightly-2021-03-16-173612 generation 1" } ...other operands... ] [1]: https://github.com/openshift/origin/pull/25918#event-4423357757 [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#release-openshift-ocp-installer-e2e-aws-serial-4.8 [3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256 [4]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L147-L151 [5]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L43-L47 --- Additional comment from W. Trevor King on 2021-03-16 23:21:53 UTC --- Clayton points out that you should also be able to compare your current, leveled 'operator' version with your desired version to decide if you are updating. And then... do something to see if you bumped the DaemonSet due to a config change. If there are no operator-config knobs that feed into the DaemonSet config, then great :). If there are, you could always record something about the most-recently-leveled config generation(s) or hashes or whatever under ClusterOperator's status.versions. I dunno; gets a bit fiddly.
This does not look like an urgent issue that blocks 4.8.
Yeah, blocker- makes sense to me, regardless of severity. I doubt this needs a backport. It's mostly about de-noising [1] and similar, by bringing operator behavior in line with the documented API. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#periodic-ci-openshift-release-master-ci-4.8-e2e-aws-serial
After the library-go patch merges, we should: 1. Bump library-go in CSI operators 2. Fix CSO to use the new function exposed in library-go to check if Deployment is Progressing
Going progressing, when the operand has not achieved a steady state is expected behavior. PR https://github.com/openshift/library-go/pull/1073 actually takes previously functioning code and makes it incorrect by preventing the operator from going progressing in cases when the operand is not in a steady state. This code should not change. If there's an issue with the docs, we can fix it, but if the operand is reacting to a change in the cluster state, whether it was user initiated or not, the operator status should be progressing=true.