Bug 1961506
Summary: | NodePorts do not work on RHEL 7.9 workers (was "4.7 -> 4.8 upgrade is stuck at Ingress operator Degraded with rhel 7.9 workers") | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> | |
Component: | Networking | Assignee: | Tim Rozet <trozet> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aconstan, anusaxen, aos-bugs, astoycos, bbennett, geliu, gpei, hongli, huirwang, mifiedle, mmasters, scuppett, sgreene, trozet, wsun, zzhao | |
Version: | 4.8 | Keywords: | TestBlocker, Upgrades | |
Target Milestone: | --- | Flags: | trozet:
needinfo-
|
|
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1973813 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:08:55 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1973813 | |||
Bug Blocks: |
Description
Ke Wang
2021-05-18 06:38:17 UTC
This bug blocked this upgrade path test, so added UpgradeBlocker. This is hard to repro issue so I suggest to get oc adm must-gather as soon as you see "Unable to apply 4.8.0-0.nightly-2021-06-03-101158: wait has exceeded 40 minutes for these operators: ingress" *** Bug 1971832 has been marked as a duplicate of this bug. *** The root cause is a bug in openvswitch with handling check_pkt_larger action on older kernels. The packet will be punted to userspace and is not making it to conntrack afterwards. Filed: https://bugzilla.redhat.com/show_bug.cgi?id=1973465 Even if we get a fix for OVS with userspace, the network performance will be bad for packets directed towards OVN on RHEL 7.9 nodes that dont support the check_pkt_len action. After some discussion within the team, it makes sense to try to disable these flows if the kernel does not support this action. The trade off is ICMP frag needed will no longer work for packets sent to OVN with an MTU larger than the pod MTU on these nodes. But that trade off is better than regressing in performance for RHEL nodes. Removing the OVS bug (https://bugzilla.redhat.com/show_bug.cgi?id=1973465) as a dependency, as we will detect proper support using ovn-kube: https://github.com/ovn-org/ovn-kubernetes/pull/2267 Adding testblocker keyword since it's blocking the regression test against ovn with rhel cluster. Per Comment 18, I also tested fresh install OVN cluster with latest build, the ingress router pod running on RHEL worker node works well. $ oc get network/cluster -o jsonpath='{.status.networkType}' OVNKubernetes $ oc -n openshift-ingress get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-5787f6cd56-6sxl2 1/1 Running 0 17h 10.130.2.107 ip-10-0-56-71.us-east-2.compute.internal <none> <none> $ oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-52-205.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.52.205 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-56-22.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.56.22 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-56-71.us-east-2.compute.internal Ready worker 22h v1.21.0-rc.0+766a5fe 10.0.56.71 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-11.rhaos4.8.git30ca719.el7 ip-10-0-58-118.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.58.118 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-58-129.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.58.129 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-66-137.us-east-2.compute.internal Ready worker 23h v1.21.0-rc.0+766a5fe 10.0.66.137 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 ip-10-0-68-81.us-east-2.compute.internal Ready master 23h v1.21.0-rc.0+766a5fe 10.0.68.81 <none> Red Hat Enterprise Linux CoreOS 48.84.202106231817-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.8.0-0.nightly-2021-06-28-165738 True False False 17h $ curl https://canary-openshift-ingress-canary.apps.hongli-ovn.qe.devcluster.openshift.com -k Healthcheck requested ok, So now this issue only affect customer 4.7-> 4.7 user and 4.7 to 4.8 user. We found 4.6->4.7 is fine I added release note in https://github.com/openshift/openshift-docs/issues/29652#issuecomment-871157879 So according to comment 38 and 42. I think we can move this but to 'verified' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |