1680562 – [network-operator] It may take about 5 minutes to update the operator status back to normal after fix the problem in network config

Bug 1680562 - [network-operator] It may take about 5 minutes to update the operator status back to normal after fix the problem in network config

Summary: [network-operator] It may take about 5 minutes to update the operator status ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-25 10:26 UTC by Meng Bo
Modified:	2019-06-04 10:44 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
network_operator_log (16.27 KB, text/plain) 2019-02-25 10:26 UTC, Meng Bo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 115	0	None	closed	Bug 1680562: ClusterOperator.Status fixes	2020-10-13 13:11:49 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:33 UTC

Description Meng Bo 2019-02-25 10:26:26 UTC

Created attachment 1538395 [details]
network_operator_log

Description of problem:
The new added network clusteroperator.config.openshift.io can monitor the network operator and report back the status in real time. But it may take long time to report that the operator is available after fix the config from a problem.

Version-Release number of selected component (if applicable):
v4.0.0-0.177.0

How reproducible:
always 

Steps to Reproduce:
1. Setup ocp cluster

2. Check the network clusteroperator.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
network             True        False         False     7s

3. Make some problem in the network.config.openshift.io

4. Check that the clusteroperator.config.openshift.io is reporting FAILING
NAME      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
network             False       False         True      9s

5. Fix the problem in the network.config.openshift.io

6. Watch the clusteroperator 

Actual results:
It may take about 5 mins to report the cluster available again.

NAME      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
network             False       False         False     4m


Expected results:
Should report the operator status in real time.

Additional info:
Full log of the network operator attached.

Comment 1 Dan Winship 2019-03-05 17:15:05 UTC

The fix for this also changes the behavior of the status conditions a bit. In particular, now when you break the configuration, the operator Failing status will become True, but the Available status will not change; the operator will report that it is both Failing and Available. Then when you fix the config again, it should report Failing False, but you won't need to wait for Available to change, because it never changed in the first place.

Comment 3 Anurag saxena 2019-03-14 18:26:05 UTC

(In reply to Dan Winship from comment #1)
> The fix for this also changes the behavior of the status conditions a bit.
> In particular, now when you break the configuration, the operator Failing
> status will become True, but the Available status will not change; the
> operator will report that it is both Failing and Available. Then when you
> fix the config again, it should report Failing False, but you won't need to
> wait for Available to change, because it never changed in the first place.

Yep, as noticed on 4.0.0-0.nightly-2019-03-13-233958 , now it seems like True,False,True after bad config

# oc get clusteroperators.config.openshift.io | grep "network\|NAME"
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network                               4.0.0-0.nightly-2019-03-13-233958   True        False         True      125m

and ~20 seconds post correct config, its shows True,False,False

# oc get clusteroperators.config.openshift.io | grep "network\|NAME"
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network                               4.0.0-0.nightly-2019-03-13-233958   True        False         False     129m

Comment 5 Meng Bo 2019-03-19 07:55:17 UTC

Checked on OCP 4.0.0-0.nightly-2019-03-19-004004

The status of the network operator at the beginning:
# oc get clusteroperator network -o wide
NAME      VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network   4.0.0-0.nightly-2019-03-19-004004   True        False         False     84m

When I making the problem in network.config.openshift.io/cluster:
# oc get clusteroperator network
NAME      VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network   4.0.0-0.nightly-2019-03-19-004004   True        False         True      86m

After I fix the problem above, it will refresh the status in a few seconds:
# oc get clusteroperator network
NAME      VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network   4.0.0-0.nightly-2019-03-19-004004   True        False         False     86m

Mark the bug as verified.

Comment 7 errata-xmlrpc 2019-06-04 10:44:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.