Bug 1805444
Summary: | [4.3] Multus should not cause machine to go not ready when a default SDN is updated | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Douglas Smith <dosmith> |
Component: | Networking | Assignee: | Douglas Smith <dosmith> |
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | aconstan, danw, jeder, lmohanty, nberry, pbergene, william.caban, wking |
Version: | 4.3.z | Keywords: | Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.3.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-03-10 23:54:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1805774 | ||
Bug Blocks: |
Description
Douglas Smith
2020-02-20 19:45:34 UTC
Please answer the following question to do the impact analysis of the bug. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is recovery impossible (bricked cluster)? What is the observed rate of failure we see in CI? Is there a manual workaround that exists to recover from the bug? What are manual steps? (In reply to Lalatendu Mohanty from comment #1) > Please answer the following question to do the impact analysis of the bug. > > What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing > this bug exhibit? This is an attempt to fix bug 1785457 and there are more details there. (This is split off from that because it's not clear if this is a complete fix or not.) I don't know what shows up in Telemetry or Insights, but the user-visible effect is that customer traffic is dropped during cluster upgrades. > What kind of clusters are impacted because of the bug? Particularly GCP, but to some extent all. > What cluster functionality is degraded while hitting the bug? Basically all of it. Network traffic is disrupted, and in particular, traffic to the apiservers is disrupted. > Can this bug cause data loss? Data loss = API server data loss or CRD state > information loss etc. No > Is it possible to recover the cluster from the bug? > Is recovery automatic without intervention? I.e. is the condition transient? > Is recovery possible with the only intervention being 'oc adm upgrade …' to > a new release image with a fix? > Is recovery impossible (bricked cluster)? > Is there a manual workaround that exists to recover from the bug? What are > manual steps? Recovery is automatic > What is the observed rate of failure we see in CI? Not sure but there are 9 customer cases attached to 1785457 Tested and verified in 4.3.0-0.nightly-2020-03-02-094404 [root@dhcp-41-193 FILE]# oc describe daemonset.apps/multus | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# [root@dhcp-41-193 FILE]# [root@dhcp-41-193 FILE]# oc describe pod/multus-b5ckl | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# oc describe pod/multus-dx6d2 | grep readiness-indicator-file --readiness-indicator-file=/var/run/multus/cni/net.d/10-ovn-kubernetes.conf [root@dhcp-41-193 FILE]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-02-094404 True False 16m Cluster version is 4.3.0-0.nightly-2020-03-02-094404 [root@dhcp-41-193 FILE]# *** Bug 1806603 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0676 (In reply to Dan Winship from comment #2) > (In reply to Lalatendu Mohanty from comment #1) > > What cluster functionality is degraded while hitting the bug? > > Basically all of it. Network traffic is disrupted, and in particular, > traffic to the apiservers is disrupted. > ... > > Is it possible to recover the cluster from the bug? > > ... > > Recovery is automatic Following up out of band with Dan, it's also important to note that the automatic recovery should only take a minute or two, so while the networking impact is severe, it's also brief. And since [1] our position has been that brief workload downtime is acceptable (or at least not sufficient grounds to pull an update edge). I'm removing the UpgradeBlocker keyword [1]: https://github.com/openshift/cincinnati-graph-data/pull/40 |