Bug 1805444

Summary: [4.3] Multus should not cause machine to go not ready when a default SDN is updated
Product: OpenShift Container Platform Reporter: Douglas Smith <dosmith>
Component: NetworkingAssignee: Douglas Smith <dosmith>
Networking sub component: multus QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aconstan, danw, jeder, lmohanty, nberry, pbergene, william.caban, wking
Version: 4.3.zKeywords: Upgrades
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 23:54:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1805774    
Bug Blocks:    

Description Douglas Smith 2020-02-20 19:45:34 UTC
Description of problem: This was discovered in the investigation of https://bugzilla.redhat.com/show_bug.cgi?id=1793635 (for upgrades)

The fix includes using the Multus "readinessindicatorfile".

How reproducible: During upgrades.

Comment 1 Lalatendu Mohanty 2020-02-27 18:49:16 UTC
Please answer the following question to do the impact analysis of the bug.

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is recovery impossible (bricked cluster)?
What is the observed rate of failure we see in CI?
Is there a manual workaround that exists to recover from the bug? What are manual steps?

Comment 2 Dan Winship 2020-02-28 13:31:38 UTC
(In reply to Lalatendu Mohanty from comment #1)
> Please answer the following question to do the impact analysis of the bug.
> 
> What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing
> this bug exhibit?

This is an attempt to fix bug 1785457 and there are more details there. (This is split off from that because it's not clear if this is a complete fix or not.)

I don't know what shows up in Telemetry or Insights, but the user-visible effect is that customer traffic is dropped during cluster upgrades.

> What kind of clusters are impacted because of the bug? 

Particularly GCP, but to some extent all.

> What cluster functionality is degraded while hitting the bug?

Basically all of it. Network traffic is disrupted, and in particular, traffic to the apiservers is disrupted.

> Can this bug cause data loss? Data loss = API server data loss or CRD state
> information loss etc. 

No

> Is it possible to recover the cluster from the bug?
> Is recovery automatic without intervention?  I.e. is the condition transient?
> Is recovery possible with the only intervention being 'oc adm upgrade …' to
> a new release image with a fix?
> Is recovery impossible (bricked cluster)?
> Is there a manual workaround that exists to recover from the bug? What are
> manual steps?

Recovery is automatic

> What is the observed rate of failure we see in CI?

Not sure but there are 9 customer cases attached to 1785457

Comment 5 Weibin Liang 2020-03-02 19:12:28 UTC
Tested and verified in 4.3.0-0.nightly-2020-03-02-094404

[root@dhcp-41-193 FILE]# oc describe daemonset.apps/multus | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# 
[root@dhcp-41-193 FILE]# 
[root@dhcp-41-193 FILE]# oc describe pod/multus-b5ckl | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# oc describe pod/multus-dx6d2 | grep readiness-indicator-file
      --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf
[root@dhcp-41-193 FILE]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-02-094404   True        False         16m     Cluster version is 4.3.0-0.nightly-2020-03-02-094404
[root@dhcp-41-193 FILE]#

Comment 6 Stephen Cuppett 2020-03-06 13:00:54 UTC
*** Bug 1806603 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-03-10 23:54:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0676

Comment 9 W. Trevor King 2020-03-16 19:39:50 UTC
(In reply to Dan Winship from comment #2)
> (In reply to Lalatendu Mohanty from comment #1)
> > What cluster functionality is degraded while hitting the bug?
> 
> Basically all of it. Network traffic is disrupted, and in particular,
> traffic to the apiservers is disrupted.
> ...
> > Is it possible to recover the cluster from the bug?
> > ...
>
> Recovery is automatic

Following up out of band with Dan, it's also important to note that the automatic recovery should only take a minute or two, so while the networking impact is severe, it's also brief.  And since [1] our position has been that brief workload downtime is acceptable (or at least not sufficient grounds to pull an update edge).  I'm removing the UpgradeBlocker keyword

[1]: https://github.com/openshift/cincinnati-graph-data/pull/40