Description of problem: This was discovered in the investigation of https://bugzilla.redhat.com/show_bug.cgi?id=1793635 (for upgrades) The fix includes using the Multus "readinessindicatorfile". How reproducible: During upgrades.
Please answer the following question to do the impact analysis of the bug. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is recovery impossible (bricked cluster)? What is the observed rate of failure we see in CI? Is there a manual workaround that exists to recover from the bug? What are manual steps?
(In reply to Lalatendu Mohanty from comment #1) > Please answer the following question to do the impact analysis of the bug. > > What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing > this bug exhibit? This is an attempt to fix bug 1785457 and there are more details there. (This is split off from that because it's not clear if this is a complete fix or not.) I don't know what shows up in Telemetry or Insights, but the user-visible effect is that customer traffic is dropped during cluster upgrades. > What kind of clusters are impacted because of the bug? Particularly GCP, but to some extent all. > What cluster functionality is degraded while hitting the bug? Basically all of it. Network traffic is disrupted, and in particular, traffic to the apiservers is disrupted. > Can this bug cause data loss? Data loss = API server data loss or CRD state > information loss etc. No > Is it possible to recover the cluster from the bug? > Is recovery automatic without intervention? I.e. is the condition transient? > Is recovery possible with the only intervention being 'oc adm upgrade …' to > a new release image with a fix? > Is recovery impossible (bricked cluster)? > Is there a manual workaround that exists to recover from the bug? What are > manual steps? Recovery is automatic > What is the observed rate of failure we see in CI? Not sure but there are 9 customer cases attached to 1785457
Tested and verified in 4.3.0-0.nightly-2020-03-02-094404 [root@dhcp-41-193 FILE]# oc describe daemonset.apps/multus | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# [root@dhcp-41-193 FILE]# [root@dhcp-41-193 FILE]# oc describe pod/multus-b5ckl | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# oc describe pod/multus-dx6d2 | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-02-094404 True False 16m Cluster version is 4.3.0-0.nightly-2020-03-02-094404 [root@dhcp-41-193 FILE]#
*** Bug 1806603 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0676
(In reply to Dan Winship from comment #2) > (In reply to Lalatendu Mohanty from comment #1) > > What cluster functionality is degraded while hitting the bug? > > Basically all of it. Network traffic is disrupted, and in particular, > traffic to the apiservers is disrupted. > ... > > Is it possible to recover the cluster from the bug? > > ... > > Recovery is automatic Following up out of band with Dan, it's also important to note that the automatic recovery should only take a minute or two, so while the networking impact is severe, it's also brief. And since [1] our position has been that brief workload downtime is acceptable (or at least not sufficient grounds to pull an update edge). I'm removing the UpgradeBlocker keyword [1]: https://github.com/openshift/cincinnati-graph-data/pull/40