Bug 1781948

Summary: Ingress operator logs spurious "failed to sync ingresscontroller status" errors
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, dmace, mifiedle, mmasters, wabouham
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The ingress operator was logging "failed to sync ingresscontroller status" errors and rapidly rerunning its synchronization loop when an ingresscontroller's "Degraded" status condition was true, even if the operator had in fact successfully updated the ingresscontroller's status. Consequence: The ingress operator logs had spurious error messages. Fix: The error handling logic was fixed to avoid the spurious error messages. Result: The "failed to sync ingresscontroller status" errors should no longer appear in the ingress operator's logs when it updates an ingresscontroller's status.
Story Points: ---
Clone Of: 1781345
: 1781950 (view as bug list) Environment:
Last Closed: 2020-05-13 21:54:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1781950    

Description Miciah Dashiel Butler Masters 2019-12-10 23:21:03 UTC
The ingress operator logs spurious "failed to sync ingresscontroller status" errors, even if it has successfully updated the ingresscontroller's status, when the ingresscontroller's "Degraded" status condition is true. 

+++ This bug was initially created as a clone of Bug #1781345 +++

Description of problem:
This on an 4.3 OCP IPI installed cluster on Azure.  When trying to run the node-vertical test where we try to deploy up to 250 gcr.io/google_containers/pause-amd64:3.0 pods per worker node in a single namespace, the ingress operator degraded and 2 worker nodes became NotReady.
This cluster is fips-enabled and using SDN network type.

root@ip-172-31-40-229: ~/openshift-scale/workloads/workloads # oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      148m
cloud-credential                           4.3.0-0.nightly-2019-12-09-035405   True        False         False      170m
cluster-autoscaler                         4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
console                                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      156m
dns                                        4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
image-registry                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      17m
ingress                                    4.3.0-0.nightly-2019-12-09-035405   False       True          True       23m
insights                                   4.3.0-0.nightly-2019-12-09-035405   True        False         False      167m
kube-apiserver                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      165m
kube-controller-manager                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
kube-scheduler                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      163m
machine-api                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
machine-config                             4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
marketplace                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
monitoring                                 4.3.0-0.nightly-2019-12-09-035405   False       True          True       22m
network                                    4.3.0-0.nightly-2019-12-09-035405   True        True          True       165m
node-tuning                                4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
openshift-apiserver                        4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
openshift-controller-manager               4.3.0-0.nightly-2019-12-09-035405   True        False         False      165m
openshift-samples                          4.3.0-0.nightly-2019-12-09-035405   True        False         False      161m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-09-035405   True        False         False      166m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
service-ca                                 4.3.0-0.nightly-2019-12-09-035405   True        False         False      167m
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-09-035405   True        False         False      164m
storage                                    4.3.0-0.nightly-2019-12-09-035405   True        False         False      162m
root@ip-172-31-40-229: ~/openshift-scale/workloads/workloads # 


In openshift-ingress-operator logs, I am seeing:

2019-12-09T15:48:01.016Z        ERROR   operator.init.controller-runtime.controller     controller/controller.go:218    Reconciler error        {"controller": "ingress_controller", "request": "openshift-ingress-operator/default", "error": "failed to sync ingresscontroller status: IngressController is degraded", "errorCauses": [{"error": "failed to sync ingresscontroller status: IngressController is degraded"}]}

[...]

Comment 2 Hongan Li 2019-12-23 03:25:35 UTC
verified with 4.4.0-0.nightly-2019-12-20-210709

Comment 4 errata-xmlrpc 2020-05-13 21:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581