Bug 1746924

Summary: Status and Reason not set correctly for network-operator
Product: OpenShift Container Platform Reporter: Lili Cosic <lcosic>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, bbennett
Version: 4.1.z   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-22 14:14:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lili Cosic 2019-08-29 14:09:20 UTC
Description of problem:
While debugging a 4.1.9 cluster that was failing to be upgraded, we noticed that the networking operator was possibly not setting the status correctly. It was in the Progressing condition for over a week, Reason was set to Deploying.

Version-Release number of selected component (if applicable):
4.1.9 while upgrading to 4.1.11

How reproducible:
Don't know.

Steps to Reproduce:
1. Break a node in the cluster, and trigger an upgrade from 4.1.9 -> 4.1.11, this should trigger a rollout to the broken node which will fail, but the actual status will not change from progressing to Degraded condition status.


Actual results:


Expected results:
Expected the status to change if there was a problem with the rollout. 

Additional info:
We did see the alerts for KubeDaemonSetRolloutStuck for multus and sdn.

Comment 1 Casey Callendrello 2019-08-30 08:00:28 UTC
Yeah, we naively translate the status of the daemonset. We should probably have some sort of timeout that detects when a rollout is hung.

Do you know if there are any best practices for this? I could see this varying based on, say, the size of the cluster. A one-at-a-time daeomset rollout will take a long time on a lot of nodes.

Comment 2 Lili Cosic 2019-08-30 08:36:05 UTC
Not sure timeout is the right approach here. I would suggest on every reconcile error in your operator to set the Status to Degraded and the Reason to a predefined error, something like "DaemonsetMultusError" whenever one of the Daemonsets or any other Resources gets an error. I assume you check the desired number of your Resources with the current number of Resources and there you can detect the error. You can look at how other operators set their Reason as well.

Note: the Reason string should be a predefined, bound list of messages, so we do not produce high cardinality metrics.

Comment 3 Casey Callendrello 2019-11-22 14:14:14 UTC
Fixed in https://github.com/openshift/cluster-network-operator/pull/358.