Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1725220

Summary: CNO stuck in Progressing=True as it doesn't refresh Multus DS state on startup
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Status: CLOSED ERRATA QA Contact: zhaozhanqi <zzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.zCC: aos-bugs, bbennett, danw, weliang
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:32:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vadim Rutkovsky 2019-06-28 17:51:33 UTC
Description of problem:
Using a chaosmonkey-like game CNO got stuck in Progressing=True state

Version-Release number of selected component (if applicable):
4.1.3

How reproducible:
Rare chance to hit this on actual system

Steps to Reproduce:
1. Kill one of the multus pods
2. Kill CNO pod


Actual results:
If the new CNO pod comes back sooner than multus DS refreshes status, CNO would get stuck in Progressing=True

Expected results:
CNO would refresh Progressing state once Multus DS is available

Additional info:

Comment 2 Dan Winship 2019-06-28 18:01:13 UTC
It's not just multus. The problem is that CNO doesn't ensure that the operator status is correct when it starts up. It only updates it when something changes while the CNO is running. So the order of events is:

  1. multus pod is killed
  2. multus daemonset updates to reflect that we're missing a multus pod
  3. CNO sees the daemonset change, updates operator state to Progressing
  4. CNO is killed
  5. multus pod is restarted, multus daemonset updates to say it's OK
  6. CNO is restarted, does nothing
  7. (5 minutes later) CNO does a full resync, sees that nothing has changed, does nothing
  8. (eventually) another multus pod is killed and comes back, CNO finally fixes operator status

Comment 3 Casey Callendrello 2019-07-29 13:21:13 UTC
I think Alexander fixed this. Over him to verify and close.

Comment 4 Alexander Constantinescu 2019-07-30 14:13:26 UTC
Yes, this has been fixed with the PR: https://github.com/openshift/cluster-network-operator/pull/232

I am assigning "modified" for QA testing.

Comment 6 Weibin Liang 2019-07-30 15:56:54 UTC
Tested and verified in v4.2.0-0.ci-2019-07-30-115127, CNO would not stuck in Progressing=True any more.

[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             False       True          False      43s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             False       True          False      63s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      2s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      14s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      95s
[root@dhcp-41-193 ~]#

Comment 7 errata-xmlrpc 2019-10-16 06:32:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922