1725220 – CNO stuck in Progressing=True as it doesn't refresh Multus DS state on startup

Bug 1725220 - CNO stuck in Progressing=True as it doesn't refresh Multus DS state on startup

Summary: CNO stuck in Progressing=True as it doesn't refresh Multus DS state on startup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-28 17:51 UTC by Vadim Rutkovsky
Modified:	2019-10-16 06:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:32:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:57 UTC

Description Vadim Rutkovsky 2019-06-28 17:51:33 UTC

Description of problem:
Using a chaosmonkey-like game CNO got stuck in Progressing=True state

Version-Release number of selected component (if applicable):
4.1.3

How reproducible:
Rare chance to hit this on actual system

Steps to Reproduce:
1. Kill one of the multus pods
2. Kill CNO pod


Actual results:
If the new CNO pod comes back sooner than multus DS refreshes status, CNO would get stuck in Progressing=True

Expected results:
CNO would refresh Progressing state once Multus DS is available

Additional info:

Comment 2 Dan Winship 2019-06-28 18:01:13 UTC

It's not just multus. The problem is that CNO doesn't ensure that the operator status is correct when it starts up. It only updates it when something changes while the CNO is running. So the order of events is:

  1. multus pod is killed
  2. multus daemonset updates to reflect that we're missing a multus pod
  3. CNO sees the daemonset change, updates operator state to Progressing
  4. CNO is killed
  5. multus pod is restarted, multus daemonset updates to say it's OK
  6. CNO is restarted, does nothing
  7. (5 minutes later) CNO does a full resync, sees that nothing has changed, does nothing
  8. (eventually) another multus pod is killed and comes back, CNO finally fixes operator status

Comment 3 Casey Callendrello 2019-07-29 13:21:13 UTC

I think Alexander fixed this. Over him to verify and close.

Comment 4 Alexander Constantinescu 2019-07-30 14:13:26 UTC

Yes, this has been fixed with the PR: https://github.com/openshift/cluster-network-operator/pull/232

I am assigning "modified" for QA testing.

Comment 6 Weibin Liang 2019-07-30 15:56:54 UTC

Tested and verified in v4.2.0-0.ci-2019-07-30-115127, CNO would not stuck in Progressing=True any more.

[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             False       True          False      43s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             False       True          False      63s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      2s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      14s
[root@dhcp-41-193 ~]# oc get co network 
NAME      VERSION                        AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.ci-2019-07-30-115127   True        False         False      95s
[root@dhcp-41-193 ~]#

Comment 7 errata-xmlrpc 2019-10-16 06:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.