Bug 1939744 - Network operator goes Progressing=True on DaemonSet rollout, despite no config changes
Summary: Network operator goes Progressing=True on DaemonSet rollout, despite no confi...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Nadia Pinaeva
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-16 22:58 UTC by W. Trevor King
Modified: 2021-09-20 13:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1940286 1941224 (view as bug list)
Environment:
Last Closed: 2021-09-20 13:17:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2021-03-16 22:58:52 UTC
Origin recently began watching ClusterOperator conditions for surprising behavior [1].  That's turned up things like [2,3]:

  [bz-Networking] clusteroperator/network should not change condition/Progressing 
    Run #0: Failed	0s
      2 unexpected clusteroperator state transitions during e2e test run 

      network was Progressing=false, but became Progressing=true at 2021-03-16 18:58:24.146588772 +0000 UTC -- DaemonSet "openshift-sdn/ovs" update is rolling out (6 out of 7 updated)
      network was Progressing=true, but became Progressing=false at 2021-03-16 19:01:38.792711425 +0000 UTC -- 

Per the API docs, however, Progressing is for:

  Progressing indicates that the operator is actively rolling out new code, propagating config changes, or otherwise moving from one steady state to another.  Operators should not report progressing when they are reconciling a previously known state.

That makes "my operand DaemonSet is not completely reconciled right now" a bit complicated, because you need to remember if it is the first attempt at reconciling the current configuration or a later attempt at reconciling the current configuration.  In this case, the 18:58 disruption seems to have been a new node coming up:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 18:58: | head -n2
  Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 18:58:23.870 I node/ip-10-0-138-230.us-west-2.compute.internal reason/NodeHasSufficientPID Node ip-10-0-138-230.us-west-2.compute.internal status is now: NodeHasSufficientPID

From the MachineSet scaling test:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/e2e.log | grep 'e2e-test/\|Starting kubelet' | grep -1 'Starting kubelet'
  Mar 16 18:54:17.923 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" started
  Mar 16 18:58:23.542 I node/ip-10-0-138-230.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 18:58:27.023 I node/ip-10-0-238-216.us-west-2.compute.internal reason/Starting Starting kubelet.
  Mar 16 19:00:50.519 I e2e-test/"[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Suite:openshift/conformance/serial]" finishedStatus/Passed

One possibility for distinguishing between "I just bumped the DaemonSet" (ideally Progressing=True) and "it's reacting to the cluster shifting under it (ideally Progressing=False) would be storing the version string (and possibly status.observedGeneration, to account for config changes) in the ClusterOperator's status.versions [5], with names keyed by operand.  So moving from the current:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256/artifacts/e2e-aws-serial/clusteroperators.json | jq '.items[] | select(.metadata.name == "network").status.versions'
  [
    {
      "name": "operator",
      "version": "4.8.0-0.nightly-2021-03-16-173612"
    }
  ]

To something like:

  [
    {
      "name": "operator",
      "version": "4.8.0-0.nightly-2021-03-16-173612"
    },
    {
      "name": "ovs",
      "version": "4.8.0-0.nightly-2021-03-16-173612 generation 1"
    },
    ...other operands...
  ]


[1]: https://github.com/openshift/origin/pull/25918#event-4423357757
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-blocking#release-openshift-ocp-installer-e2e-aws-serial-4.8
[3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371878957158240256
[4]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L147-L151
[5]: https://github.com/openshift/api/blob/8356aa4d4afb94790d3ad58c4debe0e1bdabcbe9/config/v1/types_cluster_operator.go#L43-L47

Comment 2 W. Trevor King 2021-03-16 23:21:53 UTC
Clayton points out that you should also be able to compare your current, leveled 'operator' version with your desired version to decide if you are updating.  And then... do something to see if you bumped the DaemonSet due to a config change.  If there are no operator-config knobs that feed into the DaemonSet config, then great :).  If there are, you could always record something about the most-recently-leveled config generation(s) or hashes or whatever under ClusterOperator's status.versions.  I dunno; gets a bit fiddly.

Comment 5 Nadia Pinaeva 2021-09-15 15:18:32 UTC
Seems to be NOTABUG since API documentation has changed https://github.com/openshift/api/pull/935 and Progressing=True now supposed normal when a new node is added
Also, see similar bug for Storage https://bugzilla.redhat.com/show_bug.cgi?id=1940286 (resolved as NOTABUG)

Comment 6 Nadia Pinaeva 2021-09-20 13:17:14 UTC
Based on the previous comment, close this as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.