Bug 1976943
| Summary: | 120 node baremetal upgrade from 4.6.17 --> 4.6.25 --> 4.7.11 hangs on operator crash loop with waiting for pod flows | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dave Wilson <dwilson> |
| Component: | Networking | Assignee: | Surya Seetharaman <surya> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | aconstan, astoycos, dblack, fbaudin, scollier, sdodson, smalleni, wking, yjoseph |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-21 13:35:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 1
Dave Wilson
2021-06-28 16:27:20 UTC
Copy pasting from an email summary I sent out. Here is a summary of the main issues on the cluster. There are a couple of major issues (#1 and #2) and some other side issues. 1. The first issue has been several pods crashlooping during the upgrade due to a timed-out waiting for pod flows. We now believe this is due to ovn-controller taking a very long time for its poll interval. We have seen it take close to 30s and there seems to be a correlation between the logical port claim events and long polling intervals which immediately follow. Numan also mentioned that there is a full recompute also happening whenever a reconnect happens to the SBDB. This issue leads to increased upgrade times since each pod that is upgrading takes several attempts to come up, however does not fully block an upgrade. We do not currently know why it takes so long for the polling but think the SBDB cluster instability which might force reconnects might be exacerbating the issue. 2. The persistent issue that leads to a blocked upgrade is really missing ARP flows for some operator pods when the existing pod is deleted and a new pod is created as a part of the upgrade. The upgrade just sits there waiting for this crashlooping pod and the operator pod can't contact the kube-api service. Numan thinks this is fixed in a recent version of OVN (https://bugzilla.redhat.com/show_bug.cgi?id=1908391 backported to 4.6 or not?) but we are unsure of how to deliver this to VZW. The current workaround that helps seems to be to keep watching for these pods that are crashlooping for a long time (longer than the pods that crash loop due to issue #1) and forcing a recompute of flows on the node. 3. It looks like the ovnkube-master pods are updated before the ovnkube-node pods as a part of the network operator update. We want the ovnkube-node pods updating before the ovnkube-master pods. This was supposed to be fixed in recent versions/master but this could still potentially be an issue in 4.6. We have not seen this specifically causing any issues with the upgrade so far, just something to watch out for. 4. The multus and ovn daemonsets take a long while to upgrade (the total control plane upgrade from 4.6.17 to 4.6.25 and then 4.7.11 can be completed under 3 hrs for each hop still if issues #1 and #2 did not exist). Perf/scale opened a bug a while ago on 4.6 for slow upgrades of multus (https://bugzilla.redhat.com/show_bug.cgi?id=1920209) which was fixed in more recent 4.6 builds. Upgrading to a more recent 4.6 build instead of 4.6.25 can help speed up upgrades by a bit in spite of issues #1 and #2. 5. As part of the local gateway mode, we have several flows with negations (!=). The current theory is that this could be leading to increased processing overhead for ovn-controller. These flows are added by default and not related to any networkpolicy added by the user on the cluster. 6. While this might not be affecting upgrades specifically, we are seeing a panic on the ovnkube-master when having networkpolicies which has been fixed on recent 4.6 by https://github.com/openshift/ovn-kubernetes/pull/501 Seeing similar behavior going to from 4.6.17 to 4.6.35 (latest z stream at this point). So what z stream we pick doesn't matter as the underlying issues are OVN related. |