1805444 – [4.3] Multus should not cause machine to go not ready when a default SDN is updated

Bug 1805444 - [4.3] Multus should not cause machine to go not ready when a default SDN is updated

Summary: [4.3] Multus should not cause machine to go not ready when a default SDN is u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1806603 (view as bug list)
Depends On:	1805774
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-20 19:45 UTC by Douglas Smith
Modified:	2020-03-16 19:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-10 23:54:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 485	None	closed	Bug 1805444: Uses the readiness indicator file option for Multus [backport 4.3]	2021-02-11 02:10:00 UTC
Github	openshift multus-cni pull 48	None	closed	Bug 1805444: Exposes readinessindicatorfile and uses wait.PollImmediate [backport 4.3]	2021-02-11 02:10:01 UTC
Red Hat Product Errata	RHBA-2020:0676	None	None	None	2020-03-10 23:54:18 UTC

Description Douglas Smith 2020-02-20 19:45:34 UTC

Description of problem: This was discovered in the investigation of https://bugzilla.redhat.com/show_bug.cgi?id=1793635 (for upgrades)

The fix includes using the Multus "readinessindicatorfile".

How reproducible: During upgrades.

Comment 1 Lalatendu Mohanty 2020-02-27 18:49:16 UTC

Please answer the following question to do the impact analysis of the bug.

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? What are manual steps?

Comment 2 Dan Winship 2020-02-28 13:31:38 UTC

(In reply to Lalatendu Mohanty from comment #1)
> Please answer the following question to do the impact analysis of the bug.
> 
> What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing
> this bug exhibit?

This is an attempt to fix bug 1785457 and there are more details there. (This is split off from that because it's not clear if this is a complete fix or not.)

I don't know what shows up in Telemetry or Insights, but the user-visible effect is that customer traffic is dropped during cluster upgrades.

> What kind of clusters are impacted because of the bug? 

Particularly GCP, but to some extent all.

> What cluster functionality is degraded while hitting the bug?

Basically all of it. Network traffic is disrupted, and in particular, traffic to the apiservers is disrupted.

> Can this bug cause data loss? Data loss = API server data loss or CRD state
> information loss etc. 

No

> Is it possible to recover the cluster from the bug?
> Is recovery automatic without intervention?  I.e. is the condition transient?
> Is recovery possible with the only intervention being 'oc adm upgrade …' to
> a new release image with a fix?
> Is recovery impossible (bricked cluster)?
> Is there a manual workaround that exists to recover from the bug? What are
> manual steps?

Recovery is automatic

> What is the observed rate of failure we see in CI?

Not sure but there are 9 customer cases attached to 1785457

Comment 5 Weibin Liang 2020-03-02 19:12:28 UTC

Tested and verified in 4.3.0-0.nightly-2020-03-02-094404

[root@dhcp-41-193 FILE]# oc describe daemonset.apps/multus | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# 
[root@dhcp-41-193 FILE]# 
[root@dhcp-41-193 FILE]# oc describe pod/multus-b5ckl | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# oc describe pod/multus-dx6d2 | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-02-094404   True        False         16m     Cluster version is 4.3.0-0.nightly-2020-03-02-094404
[root@dhcp-41-193 FILE]#

Comment 6 Stephen Cuppett 2020-03-06 13:00:54 UTC

*** Bug 1806603 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-03-10 23:54:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0676

Comment 9 W. Trevor King 2020-03-16 19:39:50 UTC

(In reply to Dan Winship from comment #2)
> (In reply to Lalatendu Mohanty from comment #1)
> > What cluster functionality is degraded while hitting the bug?
> 
> Basically all of it. Network traffic is disrupted, and in particular,
> traffic to the apiservers is disrupted.
> ...
> > Is it possible to recover the cluster from the bug?
> > ...
>
> Recovery is automatic

Following up out of band with Dan, it's also important to note that the automatic recovery should only take a minute or two, so while the networking impact is severe, it's also brief.  And since [1] our position has been that brief workload downtime is acceptable (or at least not sufficient grounds to pull an update edge).  I'm removing the UpgradeBlocker keyword

[1]: https://github.com/openshift/cincinnati-graph-data/pull/40

Note You need to log in before you can comment on or make changes to this bug.