Network operator is going degraded during a normal serial run because it gets a 409 conflict while trying to update a daemonset (because the daemonset is being updated because a node is being changed). Operators must be retry and absorb normal errors (a 409 is a normal error because two writers are going), and must not go degraded due to that. This is marked high because it can cause alerts to fire and generates noise during upgrades / machine scales in production environments. 2 unexpected clusteroperator state transitions during e2e test run network was Degraded=false, but became Degraded=true at 2021-03-16 17:51:36.871168815 +0000 UTC -- Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-multus/multus: could not update object (apps/v1, Kind=DaemonSet) openshift-multus/multus: Operation cannot be fulfilled on daemonsets.apps "multus": the object has been modified; please apply your changes to the latest version and try again network was Degraded=true, but became Degraded=false at 2021-03-16 17:51:39.005966777 +0000 UTC -- https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371867184564801536 Happening in ~1/8 runs (we only create 3 new nodes, so the window for failures is small in static workloads, but for instance this is probably happening a ton on any autoscaling cluster). Please prioritize so we can get noise reduction in CI and find more serious issues faster.
*** Bug 1940992 has been marked as a duplicate of this bug. ***
I'm able to repro this by doing: 1. Run bash command like: for i in {1..100000}; do oc -n openshift-network-diagnostics patch ds network-check-target -p {\"metadata\":{\"annotations\":{\"foo\":\"${i}\"}}}; done 2. Edit network-check-target ds and change the spec (e.g. containerPort) Then I see this on the CNO logs: I0412 13:33:46.043376 2624431 log.go:181] Set operator conditions: - lastTransitionTime: "2021-04-12T10:06:52Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-04-12T11:32:53Z" message: 'Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target: could not update object (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target: Operation cannot be fulfilled on daemonsets.apps "network-check-target": the object has been modified; please apply your changes to the latest version and try again' reason: ApplyOperatorConfig status: "True" type: Degraded
https://github.com/openshift/cluster-network-operator/pull/1056/files
This is also blocking compact upgrade jobs: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade
Checking CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=clusteroperator/network+should+not+change+condition/Degraded' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 9 runs, 100% failed, 11% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 7 runs, 14% failed, 100% of failures match = 14% impact pull-ci-openshift-multus-cni-master-e2e-aws (all) - 12 runs, 100% failed, 8% of failures match = 8% impact Not too many jobs anymore, which is good.
1. deployed openshift cluster using image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-06-190249. 2. run commands as below for i in {1..100000}; do oc -n openshift-network-diagnostics patch ds network-check-target -p {\"metadata\":{\"annotations\":{\"foo\":\"${i}\"}}}; done 3. Edited network-check-target ds and change the spec containerPort from 8080 to 8081 4. checked CNO logs and didn't find logs like below I0412 13:33:46.043376 2624431 log.go:181] Set operator conditions: - lastTransitionTime: "2021-04-12T10:06:52Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-04-12T11:32:53Z" message: 'Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target: could not update object (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target: Operation cannot be fulfilled on daemonsets.apps "network-check-target": the object has been modified; please apply your changes to the latest version and try again' reason: ApplyOperatorConfig status: "True" type: Degraded Instead, it showed logs as blow: I0511 07:23:42.025750 1 log.go:184] Reconciling update to openshift-network-diagnostics/network-check-target I0511 07:23:42.049289 1 log.go:184] Set operator conditions: - lastTransitionTime: "2021-05-11T06:41:55Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-05-11T06:45:01Z" status: "False" type: Degraded - lastTransitionTime: "2021-05-11T06:41:56Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-05-11T07:23:38Z" message: DaemonSet "openshift-network-diagnostics/network-check-target" update is rolling out (4 out of 5 updated) reason: Deploying status: "True" type: Progressing - lastTransitionTime: "2021-05-11T06:42:33Z" status: "True" type: Available
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438