Bug 1939726 - clusteroperator/network should not change condition/Degraded during normal serial test execution
Summary: clusteroperator/network should not change condition/Degraded during normal se...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Aniket Bhat
QA Contact: Ying Wang
URL:
Whiteboard:
: 1940992 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-16 21:56 UTC by Clayton Coleman
Modified: 2021-07-27 22:54 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
clusteroperator/network should not change condition/Degraded
Last Closed: 2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1056 0 None open Bug 1939726: Enclose ApplyObject on RetryOnConflict 2021-04-13 14:32:04 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:20 UTC

Description Clayton Coleman 2021-03-16 21:56:50 UTC
Network operator is going degraded during a normal serial run because it gets a 409 conflict while trying to update a daemonset (because the daemonset is being updated because a node is being changed).

Operators must be retry and absorb normal errors (a 409 is a normal error because two writers are going), and must not go degraded due to that.  This is marked high because it can cause alerts to fire and generates noise during upgrades / machine scales in production environments.

2 unexpected clusteroperator state transitions during e2e test run 

network was Degraded=false, but became Degraded=true at 2021-03-16 17:51:36.871168815 +0000 UTC -- Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-multus/multus: could not update object (apps/v1, Kind=DaemonSet) openshift-multus/multus: Operation cannot be fulfilled on daemonsets.apps "multus": the object has been modified; please apply your changes to the latest version and try again
network was Degraded=true, but became Degraded=false at 2021-03-16 17:51:39.005966777 +0000 UTC -- 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371867184564801536

Happening in ~1/8 runs (we only create 3 new nodes, so the window for failures is small in static workloads, but for instance this is probably happening a ton on any autoscaling cluster).  Please prioritize so we can get noise reduction in CI and find more serious issues faster.

Comment 1 Alexander Constantinescu 2021-03-22 12:56:11 UTC
*** Bug 1940992 has been marked as a duplicate of this bug. ***

Comment 3 Ricardo Carrillo Cruz 2021-04-12 11:35:33 UTC
I'm able to repro this by doing:

1. Run bash command like:

for i in {1..100000}; do oc -n openshift-network-diagnostics patch ds network-check-target -p {\"metadata\":{\"annotations\":{\"foo\":\"${i}\"}}}; done

2. Edit network-check-target ds and change the spec (e.g. containerPort)

Then I see this on the CNO logs:

I0412 13:33:46.043376 2624431 log.go:181] Set operator conditions:                                                                                                                                                                                            
- lastTransitionTime: "2021-04-12T10:06:52Z"                                                                                                                                                                                                                  
  status: "False"                                                                                                                                                                                                                                             
  type: ManagementStateDegraded                                                                                                                                                                                                                               
- lastTransitionTime: "2021-04-12T11:32:53Z"                                                                                   
  message: 'Error while updating operator configuration: could not apply (apps/v1,                                                                                                                                                                            
    Kind=DaemonSet) openshift-network-diagnostics/network-check-target: could not                                                                                                                                                                             
    update object (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target:                                                                                                                                                               
    Operation cannot be fulfilled on daemonsets.apps "network-check-target": the object                                                                                                                                                                       
    has been modified; please apply your changes to the latest version and try again'                                                                                                                                                                         
  reason: ApplyOperatorConfig                                                                                                                                                                                                                                 
  status: "True"                                                                                                                                                                                                                                              
  type: Degraded

Comment 8 W. Trevor King 2021-05-10 03:13:36 UTC
Checking CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=clusteroperator/network+should+not+change+condition/Degraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 9 runs, 100% failed, 11% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 7 runs, 14% failed, 100% of failures match = 14% impact
pull-ci-openshift-multus-cni-master-e2e-aws (all) - 12 runs, 100% failed, 8% of failures match = 8% impact

Not too many jobs anymore, which is good.

Comment 9 Ying Wang 2021-05-11 07:43:14 UTC
1. deployed openshift cluster using image registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-06-190249.
2. run commands as below

for i in {1..100000}; do oc -n openshift-network-diagnostics patch ds network-check-target -p {\"metadata\":{\"annotations\":{\"foo\":\"${i}\"}}}; done

3. Edited network-check-target ds and change the spec containerPort from 8080 to 8081

4. checked CNO logs and didn't find logs like below
I0412 13:33:46.043376 2624431 log.go:181] Set operator conditions:                                                                                                                                                                                            
- lastTransitionTime: "2021-04-12T10:06:52Z"                                                                                                                                                                                                                  
  status: "False"                                                                                                                                                                                                                                             
  type: ManagementStateDegraded                                                                                                                                                                                                                               
- lastTransitionTime: "2021-04-12T11:32:53Z"                                                                                   
  message: 'Error while updating operator configuration: could not apply (apps/v1,                                                                                                                                                                            
    Kind=DaemonSet) openshift-network-diagnostics/network-check-target: could not                                                                                                                                                                             
    update object (apps/v1, Kind=DaemonSet) openshift-network-diagnostics/network-check-target:                                                                                                                                                               
    Operation cannot be fulfilled on daemonsets.apps "network-check-target": the object                                                                                                                                                                       
    has been modified; please apply your changes to the latest version and try again'                                                                                                                                                                         
  reason: ApplyOperatorConfig                                                                                                                                                                                                                                 
  status: "True"                                                                                                                                                                                                                                              
  type: Degraded

Instead, it showed logs as blow:

I0511 07:23:42.025750       1 log.go:184] Reconciling update to openshift-network-diagnostics/network-check-target
I0511 07:23:42.049289       1 log.go:184] Set operator conditions:
- lastTransitionTime: "2021-05-11T06:41:55Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2021-05-11T06:45:01Z"
  status: "False"
  type: Degraded
- lastTransitionTime: "2021-05-11T06:41:56Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2021-05-11T07:23:38Z"
  message: DaemonSet "openshift-network-diagnostics/network-check-target" update is
    rolling out (4 out of 5 updated)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2021-05-11T06:42:33Z"
  status: "True"
  type: Available

Comment 12 errata-xmlrpc 2021-07-27 22:53:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.