Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/320 https://ci-search-ci-search-next.svc.ci.openshift.org/?search=failed.*Services+should+be+rejected+when+no+endpoints+exist&maxAge=336h&context=0&type=all This seems to happen often on FCOS and RHEL7 - and rarely occurs on RHCOS on Azure and oVirt
This blocks OKD e2e tests, PTAL
Seems there's similar issue on OVN - https://github.com/ovn-org/ovn-kubernetes/issues/928 (not sure if the causes are related though)
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1734321#c18 I did on the host: sh-5.0# nft list ruleset sh-5.0# iptables-nft -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination # Warning: iptables-legacy tables present, use iptables-legacy to see them sh-5.0# iptables-nft -A INPUT -m comment --comment "ricky test2" -p tcp --destination 1.1.1.1 -j REJECT sh-5.0# iptables-nft -L Chain INPUT (policy ACCEPT) target prot opt source destination REJECT tcp -- anywhere one.one.one.one /* ricky test2 */ reject-with icmp-port-unreachable Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination # Warning: iptables-legacy tables present, use iptables-legacy to see them sh-5.0# nft list ruleset table ip filter { chain INPUT { type filter hook input priority filter; policy accept; meta l4proto tcp ip daddr 1.1.1.1 counter packets 0 bytes 0 reject } chain FORWARD { type filter hook forward priority filter; policy accept; } chain OUTPUT { type filter hook output priority filter; policy accept; } } sh-5.0# rpm -qa '*tables*' '*nft*' iptables-nft-1.8.3-5.fc31.x86_64 libnftnl-1.1.3-2.fc31.x86_64 iptables-1.8.3-5.fc31.x86_64 iptables-services-1.8.3-5.fc31.x86_64 iptables-libs-1.8.3-5.fc31.x86_64 nftables-0.9.1-3.fc31.x86_64 The same rules look different from inside SDN container though: sh-5.0# crictl ps -a | grep sdn fc93d2cdc5edc 880ec903e44090affabe0d5aa0898276cc1990e24f87ae261654545ac3f10cc8 19 minutes ago Running sdn 0 8ccc56bde980c sh-5.0# crictl exec -ti fc93d2cdc5edc sh sh-4.2# nft list ruleset table ip filter { chain INPUT { type filter hook input priority 0; policy accept; meta l4proto tcp ip daddr 1.1.1.1 counter packets 0 bytes 0 } chain FORWARD { type filter hook forward priority 0; policy accept; } chain OUTPUT { type filter hook output priority 0; policy accept; } } sh-4.2# iptables --version iptables v1.8.3 (legacy) sh-4.2# rpm -qa '*tables*' '*nft*' libnftnl-1.0.8-1.el7.x86_64 nftables-0.8-14.el7.x86_64 iptables-1.4.21-33.el7.x86_64 Richardo, could you help us with this one?
I just validated that nothing is wrong with the iptables rules, and everything works as intended: sh-4.2# nc 172.30.206.236 9000 Ncat: Connection refused. which is the correct behavior. The issue is that agnhost in the test image is outputting the wrong error message for the OKD test runs. No clue why.
Test returns REFUSED if its running with host network. Perhaps something is blocking ICMP packages on SDN network?
Any update on this? Still seeing this happen in CI. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/558
> Seems there's similar issue on OVN - https://github.com/ovn-org/ovn-kubernetes/issues/928 (not sure if the causes are related though) We use a different kube-proxy for OVN so the root cause must be different, but thanks a lot for adding this to the bug. It also looks like https://github.com/kubernetes/kubernetes/pull/72534 but the fix is applied. I will setup an OKD environment to reproduce this.
Note that after OKD has switched default network plugin to OVN this test no longer fails (its being skipped iirc)
Do we still intend to support OpenShiftSDN on OKD?
We do support OpenshiftSDN, however we don't run test on CI with it
// Working notes. Feel free to ignore. This looks like a kernel/iptables bug: # cat /etc/redhat-release Fedora release 31 (Thirty One) # rpm -q iptables kernel iptables-1.8.3-7.fc31.x86_64 kernel-5.4.13-201.fc31.x86_64 sh-5.0# iptables -nvL PREROUTING -t raw Chain PREROUTING (policy ACCEPT 28247 packets, 49M bytes) pkts bytes target prot opt in out source destination 7 420 TRACE all -- * * 0.0.0.0/0 172.30.240.77 sh-5.0# dmesg | grep TRACE | grep ID=2535 --color [ 2705.724089] TRACE: raw:PREROUTING:policy:2 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724097] TRACE: mangle:PREROUTING:policy:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724103] TRACE: nat:PREROUTING:rule:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724127] TRACE: nat:KUBE-SERVICES:return:116 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724132] TRACE: nat:PREROUTING:rule:2 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724137] TRACE: nat:KUBE-PORTALS-CONTAINER:return:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724141] TRACE: nat:PREROUTING:policy:5 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724158] TRACE: mangle:FORWARD:policy:1 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724163] TRACE: filter:FORWARD:rule:1 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724169] TRACE: filter:KUBE-FORWARD:return:5 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724174] TRACE: filter:FORWARD:rule:2 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) [ 2705.724179] TRACE: filter:KUBE-SERVICES:rule:2 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) sh-5.0# iptables -nvL KUBE-SERVICES Chain KUBE-SERVICES (3 references) pkts bytes target prot opt in out source destination 0 0 REJECT tcp -- * * 0.0.0.0/0 172.30.240.77 /* valadas/hello-openshift:8888-tcp has no endpoints */ tcp dpt:8888 reject-with icmp-port-unreachable 1 60 REJECT tcp -- * * 0.0.0.0/0 172.30.240.77 /* valadas/hello-openshift:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable Interesting we see the REJECT. Is is there but the application isn't getting it checking a tcpdump of the node's tun0 it look like the reject is matched, but never sent. $ tshark -r /tmp/host.pcap -Y 'ip.addr == 10.129.2.9' 51 3.702907 0.636620 11:59:01.760554 10.129.2.9 → 172.30.240.77 TCP 74 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197234432 TSecr=0 WS=128 67 4.705022 0.118114 11:59:02.762669 10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197235435 TSecr=0 WS=128 120 6.753018 0.103326 11:59:04.810665 10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197237483 TSecr=0 WS=128 439 10.784990 0.059573 11:59:08.842637 10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197241515 TSecr=0 WS=128 I will try to isolate the problem by just using a regular fedora vm so that our kernel guys don't need to deal with the whole SDN complexity.
This is affecting 92.31% of failing RHEL7 scaleup jobs. https://search.svc.ci.openshift.org/?search=failed%3A.*Services+should+be+rejected+when+no+endpoints+exist&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job
Still effecting release informing jobs for 4.4 and 4.3 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3
A minor update on this, this is happening because the kernel limits the amount of icmp messages it sends, (reject is one of them). I'm trying to figure out where these come from and why in RHEL.
Moving temporarly to kernel engineering. This probably is unlikely to be a kernel bug but I need some help from kernel engineering. The problem we have is on OpenShift, which has multiple net namespaces, hundreds of iptables rules, hundreds of ovs flows. I observe iptables rejects aren't working (they're dropped) with the default ipv4 icmp_ratelimit the vast majority of the time (>95%). If I set /proc/sys/net/ipv4/icmp_ratelimit to 0 the rejects start working again. I need to figure out why is the rate limit reaching, I tried to sniff the icmp traffic on every nic on every netns while rate_limit is set to 0: # for i in $(lsns | cut -c -90| grep net | awk '{print $4}'); do nsenter -n -t $i tcpdump -i any icmp & done) But after 15 minutes I didn't see a single icmp packet except the ones I deliberately created by hitting the REJECT rule. So what I need to figure out is where is the ICMP taffic that makes provokes icmp to be rate limited. Also I'm setting this to RHEL 8 because it's newer but it also happens in RHEL 7.
This test also fails on FedoraCoreOS 31 with kernel 5.5 - how's that an RHCOS kernel bug?
It's very unlikely that there is a problem at kernel level, I just need help from a kernel engineer to understand where is the ICMP traffic.
Happening on rhel8 too -- https://bugzilla.redhat.com/show_bug.cgi?id=1829241
Thanks for noticing the rate limiting @Juan https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt search for icmp_ratemask and icmp_ratelimit We are rate limiting Destination Unreachable based on the default kernel ratemask, and the ratelimit is 1 response every 1000 ms (1/sec) which seems low. We need to decide if we want to remove Destination Unreachable from the ratemask, or if we want to drop the rate limit to something like 100 ms.
*** Bug 1829080 has been marked as a duplicate of this bug. ***
*** Bug 1829583 has been marked as a duplicate of this bug. ***
OK this is blocking the release of RHEL 8.2 and really all OS level changes for RHCOS right now. See e.g. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10379 So we need to do *something* here - which based on my understanding right now, might be simply marking this test as flaky right?
Given that things appear to be working as normal, just that for some reason we now need to adjust kernel tuneables for ICMP, moving back to OpenShift. We should certainly figure out why this wasn't happening as much before and is being hit hard now.
Ben, I agree that just one packet per second is probably excessive, I'm still concerned about not knowing WHY this happens. I added a PR to remove the destination unreachable from the ratemask on openshift sdn.
*** Bug 1829241 has been marked as a duplicate of this bug. ***
Yeah I'm just going to unilaterally remove the private flag. We use that way too much.
(In reply to Dan Williams from comment #24) > We should certainly figure out why this wasn't happening as much before and > is being hit hard now. Presumably some OCP component has started frequently trying to access some unreachable destination.
Example failure from [1,2]: Apr 24 21:25:49.619: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-d982qc1l-8c7ff.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-services-7069 execpod-noendpoints982dr -- /bin/sh -x -c /agnhost connect --timeout=3s no-pods:80' Apr 24 21:25:53.290: INFO: rc: 1 Apr 24 21:25:53.290: INFO: error didn't contain 'REFUSED', keep trying: error running /usr/bin/kubectl --server=https://api.ci-op-d982qc1l-8c7ff.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-services-7069 execpod-noendpoints982dr -- /bin/sh -x -c /agnhost connect --timeout=3s no-pods:80: Command stdout: stderr: + /agnhost connect --timeout=3s no-pods:80 TIMEOUT command terminated with exit code 1 error: exit status 1 ... Apr 24 21:25:53.743: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes/test/e2e/network/service.go:2621]: Unexpected error: <*errors.errorString | 0xc000208960>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred ... failed: (48.6s) 2020-04-24T21:25:53 "[sig-network] Services should be rejected when no endpoints exist [Skipped:Network/OVNKubernetes] [Skipped:ibmcloud] [Suite:openshift/conformance/parallel] [Suite:k8s]" [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10242 [2]: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10242/build-log.txt
Dropping a bunch of bracketed test metadata from the bug summary now that we have it in a comment. Tooling like CI search and Sippy should be able to find this bug from the public comment, without us having to put the distracting noise in the bug summary.
I filed https://bugzilla.redhat.com/show_bug.cgi?id=1831684 via sippy so not sure if sippy was delayed or if it doesn't pick up on bugs after they've been marked public. That can likely be duped but it's also clear that isn't scoped to only clusters with RHEL workers so I'm leaving it as is for now. This appears to be blocking all merges into 4.4
The linked PR is closed, and I don't believe that we should stall all OpenShift releases until this test passes. Is there a next step and who owns it? I'd vote to skip the test for now.
OK so after a digression re-spinning up my libvirt development setup, I can clearly pin this on a kernel change from 8.1 to 8.2. Scenario: Spun up a 4.3.17 (RHCOS 8.1) cluster in libvirt, and ran: $ ./_output/local/bin/linux/amd64/openshift-tests run openshift/conformance/parallel --run 'Services should be rejected' many times - all passed. Then used `oc debug node/` on a random worker, and downloaded the 8.2 kernel and switched to it: # rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ab14709af5c410dd36ebeda30809c791ef08328af2aaa87b23372e39c17af44d CustomOrigin: Managed by machine-config-operator Version: 43.81.202004211653.0 (2020-04-21T16:58:52Z) ReplacedBasePackages: linux-firmware 20190516-94.git711d3297.el8 -> 20191202-97.gite8a0f4c9.el8, kernel-modules kernel-modules-extra kernel kernel-core 4.18.0-147.8.1.el8_1 -> 4.18.0-193.el8 I used `oc adm cordon` on the other workers to ensure the test always ran on the node with my updated kernel. And now running the test consistently fails. And just to re-verify, using `rpm-ostree reset -ol --reboot` and the test starts passing again.
Now looking at the kernel logs for changes since 8.1/8.2: walters@toolbox ~/s/d/r/kernel> git log 11e3daec92f2a14a435eaedf05f855d14ff86739.. |grep icmp - [net] icmp: fix data-race in cmp_global_allow() (Guillaume Nault) [1801587] - [net] ipv4/icmp: fix rt dst dev null pointer dereference (Guillaume Nault) [1765639] - [net] bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER (Hangbin Liu) [1756799]
OK I started with 4.18.0-152.el8 as a bisection target based on the above, and this error still reproduces with that. In fact the test fails more or less instantly, consistently. I then went back to 4.18.0-151.el8 and got one failure, but the test started consistently succeeding after that. So I think that points to changes from https://bugzilla.redhat.com/show_bug.cgi?id=1765639
(In reply to Colin Walters from comment #35) > OK so after a digression re-spinning up my libvirt development setup, I can > clearly pin this on a kernel change from 8.1 to 8.2. > > Scenario: > > Spun up a 4.3.17 (RHCOS 8.1) cluster in libvirt, and ran: I completely agree, on ppc64le platform we have started seeing this issue post 4.3.18 release(that is when actual switch from 8.1->8.2 kernel happened) > > $ ./_output/local/bin/linux/amd64/openshift-tests run > openshift/conformance/parallel --run 'Services should be rejected' > > many times - all passed. > > Then used `oc debug node/` on a random worker, and downloaded the 8.2 kernel > and switched to it: > > # rpm-ostree status > State: idle > AutomaticUpdates: disabled > Deployments: > * > pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > ab14709af5c410dd36ebeda30809c791ef08328af2aaa87b23372e39c17af44d > CustomOrigin: Managed by machine-config-operator > Version: 43.81.202004211653.0 (2020-04-21T16:58:52Z) > ReplacedBasePackages: linux-firmware 20190516-94.git711d3297.el8 -> > 20191202-97.gite8a0f4c9.el8, kernel-modules kernel-modules-extra kernel > kernel-core 4.18.0-147.8.1.el8_1 -> 4.18.0-193.el8 > > I used `oc adm cordon` on the other workers to ensure the test always ran on > the node with my updated kernel. > > And now running the test consistently fails. And just to re-verify, using > `rpm-ostree reset -ol --reboot` and the test starts passing again.
One thing that hints at this being an upstream change is the original comment that it's consistently failing on Fedora kernels. The fact that it's been reproducing on RHEL7 would seem to rule that out but possibly it (or something similar) was backported there earlier? I did find https://bugzilla.redhat.com/show_bug.cgi?id=1461282 which is another example of ICMP ratelimiting changes breaking tests.
See also https://bugzilla.redhat.com/show_bug.cgi?id=1832332
*** Bug 1833136 has been marked as a duplicate of this bug. ***
As Colin said, and Vadim reported here, this was reproducible 100% of the times with OKD on Fedora/FCOS 31 with Kernel 5.x. Because of this issue we switched OKD over from OpenShift-SDN to OVN where it has worked reliably without this failure since
[build-cop] this is failing consistently on: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3/54
See https://bugzilla.redhat.com/show_bug.cgi?id=1832332#c14 for the latest status on this. We'll leave this bug open to track shipping that updated kernel fix in RHCOS/OCP.
I want to outline some options here from my PoV, and they aren't mutually exclusive: - Get patched kernel attached to 8.2.z errata and ship it before 8.2.2 ships (RHEL kernel team + stakeholders agreeing to ship in OpenShift before errata ships) - Chase down whatever in the SDN or test framework is causing the ICMP redirects (OpenShift SDN team) - Ship 8.2.0 as is and temporarily disable the test https://github.com/openshift/origin/pull/24980/ (OpenShift architects decision)
@Micah: how can we verify if the 8.2.z fix for this bug is in the latest RHCOS?
It looks like the 4.4 kernels available have the fixes: - http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.4-el8/building/x86_64/os/Packages/ - http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.5-el8/building/x86_64/os/Packages/ - http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.6-el8/building/x86_64/os/Packages/
Removing OKDBlocker, this affects OCP as well so it's unreasonable to block OKD for this reason. Also the test is skipped.
(In reply to Aniket Bhat from comment #49) > @Micah: how can we verify if the 8.2.z fix for this bug is in the latest > RHCOS? The underlying kernel issue appears to be BZ#1836302 From there, the BZ shows that the issue is fixed in kernel-4.18.0-193.7.1.el8_2 To find if that kernel (or newer) is in the latest RHCOS 4.6, use the release browser to show the OS Contents - For example, the latest RHCOS 4.6 build - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6&release=46.82.202009161140-0#46.82.202009161140-0 And the OS Contents for that build - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202009161140-0 Which shows that it contains kernel-4.18.0-193.19.1.el8_2 So it should be safe to test with the latest RHCOS 4.6 builds to see if this issue has been properly fixed.
There are no open cases and it's fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1832332 It's likely to be still broken in 4.4, but since nobody seems to care I'm closing this
Aww. It's unfortunate, that https://bugzilla.redhat.com/show_bug.cgi?id=1832332 is classified. Any reason for it?
(In reply to Tobias Florek from comment #58) > Aww. It's unfortunate, that > https://bugzilla.redhat.com/show_bug.cgi?id=1832332 is classified. Any > reason for it? The TLDR is https://lore.kernel.org/netdev/7f71c9a7ba0d514c9f2d006f4797b044c824ae84.1588954755.git.pabeni@redhat.com/T/#u fixes the problem.