1781575 – "[sig-network] Services should be rejected when no endpoints exist" test fails frequently on RHEL7 nodes

Bug 1781575 - "[sig-network] Services should be rejected when no endpoints exist" test fails frequently on RHEL7 nodes

Summary: "[sig-network] Services should be rejected when no endpoints exist" test fail...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Juan Luis de Sousa-Valadas
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:	SDN-CI-IMPACT
Duplicates (4):	1829080 1829241 1829583 1833136 (view as bug list)
Depends On:
Blocks:	1779811 1825255 1831684 1832332 1834184
TreeView+	depends on / blocked

Reported:	2019-12-10 09:49 UTC by Vadim Rutkovsky
Modified:	2020-11-30 15:04 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1832332 (view as bug list)
Environment:
Last Closed:	2020-10-13 09:02:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24980/	0	None	None	None	2020-10-06 23:15:07 UTC
Github	openshift sdn pull 135	0	None	closed	Bug 1781575: Discard destination unreachable from rate mask	2021-02-08 17:30:32 UTC

Description Vadim Rutkovsky 2019-12-10 09:49:14 UTC

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/320
https://ci-search-ci-search-next.svc.ci.openshift.org/?search=failed.*Services+should+be+rejected+when+no+endpoints+exist&maxAge=336h&context=0&type=all

This seems to happen often on FCOS and RHEL7 - and rarely occurs on RHCOS on Azure and oVirt

Comment 1 Vadim Rutkovsky 2019-12-10 14:24:21 UTC

This blocks OKD e2e tests, PTAL

Comment 2 Vadim Rutkovsky 2019-12-13 10:32:06 UTC

Seems there's similar issue on OVN - https://github.com/ovn-org/ovn-kubernetes/issues/928 (not sure if the causes are related though)

Comment 3 Vadim Rutkovsky 2019-12-14 11:43:03 UTC

Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1734321#c18 I did on the host:

sh-5.0# nft list ruleset
sh-5.0# iptables-nft -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
# Warning: iptables-legacy tables present, use iptables-legacy to see them
sh-5.0# iptables-nft -A INPUT -m comment --comment "ricky test2" -p tcp --destination 1.1.1.1 -j REJECT
sh-5.0# iptables-nft -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
REJECT     tcp  --  anywhere             one.one.one.one      /* ricky test2 */ reject-with icmp-port-unreachable

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
# Warning: iptables-legacy tables present, use iptables-legacy to see them
sh-5.0# nft list ruleset
table ip filter {
	chain INPUT {
		type filter hook input priority filter; policy accept;
		meta l4proto tcp ip daddr 1.1.1.1  counter packets 0 bytes 0 reject
	}

	chain FORWARD {
		type filter hook forward priority filter; policy accept;
	}

	chain OUTPUT {
		type filter hook output priority filter; policy accept;
	}
}
sh-5.0# rpm -qa '*tables*' '*nft*'
iptables-nft-1.8.3-5.fc31.x86_64
libnftnl-1.1.3-2.fc31.x86_64
iptables-1.8.3-5.fc31.x86_64
iptables-services-1.8.3-5.fc31.x86_64
iptables-libs-1.8.3-5.fc31.x86_64
nftables-0.9.1-3.fc31.x86_64



The same rules look different from inside SDN container though:

sh-5.0# crictl ps -a | grep sdn
fc93d2cdc5edc       880ec903e44090affabe0d5aa0898276cc1990e24f87ae261654545ac3f10cc8                                                                     19 minutes ago      Running             sdn                          0                   8ccc56bde980c
sh-5.0# crictl exec -ti fc93d2cdc5edc sh
sh-4.2# nft list ruleset
table ip filter {
	chain INPUT {
		type filter hook input priority 0; policy accept;
		meta l4proto tcp ip daddr 1.1.1.1 counter packets 0 bytes 0
	}

	chain FORWARD {
		type filter hook forward priority 0; policy accept;
	}

	chain OUTPUT {
		type filter hook output priority 0; policy accept;
	}
}
sh-4.2# iptables --version
iptables v1.8.3 (legacy)
sh-4.2# rpm -qa '*tables*' '*nft*'
libnftnl-1.0.8-1.el7.x86_64
nftables-0.8-14.el7.x86_64
iptables-1.4.21-33.el7.x86_64

Richardo, could you help us with this one?

Comment 4 Casey Callendrello 2019-12-19 13:56:40 UTC

I just validated that nothing is wrong with the iptables rules, and everything works as intended:

sh-4.2# nc 172.30.206.236 9000
Ncat: Connection refused.

which is the correct behavior.

The issue is that agnhost in the test image is outputting the wrong error message for the OKD test runs. No clue why.

Comment 5 Vadim Rutkovsky 2019-12-19 15:17:04 UTC

Test returns REFUSED if its running with host network. Perhaps something is blocking ICMP packages on SDN network?

Comment 6 Russell Teague 2020-02-26 21:04:05 UTC

Any update on this?  Still seeing this happen in CI.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/558

Comment 7 Juan Luis de Sousa-Valadas 2020-03-05 09:50:19 UTC

> Seems there's similar issue on OVN - https://github.com/ovn-org/ovn-kubernetes/issues/928 (not sure if the causes are related though)
We use a different kube-proxy for OVN so the root cause must be different, but thanks a lot for adding this to the bug.
It also looks like https://github.com/kubernetes/kubernetes/pull/72534 but the fix is applied.

I will setup an OKD environment to reproduce this.

Comment 8 Vadim Rutkovsky 2020-03-05 10:07:41 UTC

Note that after OKD has switched default network plugin to OVN this test no longer fails (its being skipped iirc)

Comment 9 Juan Luis de Sousa-Valadas 2020-03-05 10:12:02 UTC

Do we still intend to support OpenShiftSDN on OKD?

Comment 10 Vadim Rutkovsky 2020-03-05 10:20:11 UTC

We do support OpenshiftSDN, however we don't run test on CI with it

Comment 11 Juan Luis de Sousa-Valadas 2020-03-05 12:08:42 UTC

// Working notes. Feel free to ignore.

This looks like a kernel/iptables bug:
# cat /etc/redhat-release 
Fedora release 31 (Thirty One)
# rpm -q iptables kernel
iptables-1.8.3-7.fc31.x86_64
kernel-5.4.13-201.fc31.x86_64


sh-5.0# iptables -nvL PREROUTING -t raw 
Chain PREROUTING (policy ACCEPT 28247 packets, 49M bytes)
 pkts bytes target     prot opt in     out     source               destination         
    7   420 TRACE      all  --  *      *       0.0.0.0/0            172.30.240.77       

sh-5.0# dmesg  | grep TRACE | grep ID=2535 --color
[ 2705.724089] TRACE: raw:PREROUTING:policy:2 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724097] TRACE: mangle:PREROUTING:policy:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724103] TRACE: nat:PREROUTING:rule:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724127] TRACE: nat:KUBE-SERVICES:return:116 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724132] TRACE: nat:PREROUTING:rule:2 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724137] TRACE: nat:KUBE-PORTALS-CONTAINER:return:1 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724141] TRACE: nat:PREROUTING:policy:5 IN=tun0 OUT= MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724158] TRACE: mangle:FORWARD:policy:1 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724163] TRACE: filter:FORWARD:rule:1 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724169] TRACE: filter:KUBE-FORWARD:return:5 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724174] TRACE: filter:FORWARD:rule:2 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 
[ 2705.724179] TRACE: filter:KUBE-SERVICES:rule:2 IN=tun0 OUT=tun0 MAC=be:f7:82:cf:55:62:0a:58:0a:81:02:09:08:00 SRC=10.129.2.9 DST=172.30.240.77 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=2535 DF PROTO=TCP SPT=48300 DPT=8080 SEQ=256133300 ACK=0 WINDOW=62377 RES=0x00 SYN URGP=0 OPT (020422CF0402080ABE7132EB0000000001030307) 

sh-5.0# iptables -nvL KUBE-SERVICES
Chain KUBE-SERVICES (3 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.240.77        /* valadas/hello-openshift:8888-tcp has no endpoints */ tcp dpt:8888 reject-with icmp-port-unreachable
    1    60 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.240.77        /* valadas/hello-openshift:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable


Interesting we see the REJECT. Is is there but the application isn't getting it checking a tcpdump of the node's tun0 it look like the reject is matched, but never sent.
$ tshark -r /tmp/host.pcap -Y 'ip.addr == 10.129.2.9'
   51   3.702907 0.636620 11:59:01.760554   10.129.2.9 → 172.30.240.77 TCP 74 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197234432 TSecr=0 WS=128
   67   4.705022 0.118114 11:59:02.762669   10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197235435 TSecr=0 WS=128
  120   6.753018 0.103326 11:59:04.810665   10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197237483 TSecr=0 WS=128
  439  10.784990 0.059573 11:59:08.842637   10.129.2.9 → 172.30.240.77 TCP 74 [TCP Retransmission] 59792 → 8080 [SYN] Seq=0 Win=62377 Len=0 MSS=8911 SACK_PERM=1 TSval=3197241515 TSecr=0 WS=128

I will try to isolate the problem by just using a regular fedora vm so that our kernel guys don't need to deal with the whole SDN complexity.

Comment 13 Russell Teague 2020-04-24 17:00:07 UTC

This is affecting 92.31% of failing RHEL7 scaleup jobs.

https://search.svc.ci.openshift.org/?search=failed%3A.*Services+should+be+rejected+when+no+endpoints+exist&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 14 Russell Teague 2020-04-29 17:38:06 UTC

Still effecting release informing jobs for 4.4 and 4.3

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3

Comment 15 Juan Luis de Sousa-Valadas 2020-04-29 20:01:59 UTC

A minor update on this, this is happening because the kernel limits the amount of icmp messages it sends, (reject is one of them). I'm trying to figure out where these come from and why in RHEL.

Comment 16 Juan Luis de Sousa-Valadas 2020-04-30 10:36:29 UTC

Moving temporarly to kernel engineering. This probably is unlikely to be a kernel bug but I need some help from kernel engineering.

The problem we have is on OpenShift, which has multiple net namespaces, hundreds of iptables rules, hundreds of ovs flows. I observe iptables rejects aren't working (they're dropped) with the default ipv4 icmp_ratelimit the vast majority of the time (>95%).

If I set /proc/sys/net/ipv4/icmp_ratelimit to 0 the rejects start working again. I need to figure out why is the rate limit reaching, I tried to sniff the icmp traffic on every nic on every netns while rate_limit is set to 0:

# for i in $(lsns | cut -c -90| grep net | awk '{print $4}'); do nsenter -n  -t $i tcpdump -i any icmp &  done)

But after 15 minutes I didn't see a single icmp packet except the ones I deliberately created by hitting the REJECT rule.

So what I need to figure out is where is the ICMP taffic that makes provokes icmp to be rate limited.

Also I'm setting this to RHEL 8 because it's newer but it also happens in RHEL 7.

Comment 17 Vadim Rutkovsky 2020-04-30 11:21:25 UTC

This test also fails on FedoraCoreOS 31 with kernel 5.5 - how's that an RHCOS kernel bug?

Comment 18 Juan Luis de Sousa-Valadas 2020-04-30 11:39:12 UTC

It's very unlikely that there is a problem at kernel level, I just need help from a kernel engineer to understand where is the ICMP traffic.

Comment 19 Ben Bennett 2020-05-01 14:17:46 UTC

Happening on rhel8 too -- https://bugzilla.redhat.com/show_bug.cgi?id=1829241

Comment 20 Ben Bennett 2020-05-01 15:00:38 UTC

Thanks for noticing the rate limiting @Juan

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt search for icmp_ratemask and icmp_ratelimit

We are rate limiting Destination Unreachable based on the default kernel ratemask, and the ratelimit is 1 response every 1000 ms (1/sec) which seems low.

We need to decide if we want to remove Destination Unreachable from the ratemask, or if we want to drop the rate limit to something like 100 ms.

Comment 21 Colin Walters 2020-05-01 17:02:07 UTC

*** Bug 1829080 has been marked as a duplicate of this bug. ***

Comment 22 Colin Walters 2020-05-01 17:02:49 UTC

*** Bug 1829583 has been marked as a duplicate of this bug. ***

Comment 23 Colin Walters 2020-05-01 17:06:19 UTC

OK this is blocking the release of RHEL 8.2 and really all OS level changes for RHCOS right now.
See e.g. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10379

So we need to do *something* here - which based on my understanding right now, might be simply marking this test as flaky right?

Comment 24 Dan Williams 2020-05-01 17:22:59 UTC

Given that things appear to be working as normal, just that for some reason we now need to adjust kernel tuneables for ICMP, moving back to OpenShift. We should certainly figure out why this wasn't happening as much before and is being hit hard now.

Comment 25 Juan Luis de Sousa-Valadas 2020-05-01 19:25:49 UTC

Ben, I agree that just one packet per second is probably excessive, I'm still concerned about not knowing WHY this happens.
I added a PR to remove the destination unreachable from the ratemask on openshift sdn.

Comment 26 W. Trevor King 2020-05-02 03:53:36 UTC

*** Bug 1829241 has been marked as a duplicate of this bug. ***

Comment 28 Colin Walters 2020-05-04 12:42:09 UTC

Yeah I'm just going to unilaterally remove the private flag.  We use that way too much.

Comment 29 Dan Winship 2020-05-04 13:36:02 UTC

(In reply to Dan Williams from comment #24)
> We should certainly figure out why this wasn't happening as much before and
> is being hit hard now.

Presumably some OCP component has started frequently trying to access some unreachable destination.

Comment 30 W. Trevor King 2020-05-04 19:55:17 UTC

Example failure from [1,2]:

Apr 24 21:25:49.619: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-d982qc1l-8c7ff.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-services-7069 execpod-noendpoints982dr -- /bin/sh -x -c /agnhost connect --timeout=3s no-pods:80'
Apr 24 21:25:53.290: INFO: rc: 1
Apr 24 21:25:53.290: INFO: error didn't contain 'REFUSED', keep trying: error running /usr/bin/kubectl --server=https://api.ci-op-d982qc1l-8c7ff.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-services-7069 execpod-noendpoints982dr -- /bin/sh -x -c /agnhost connect --timeout=3s no-pods:80:
Command stdout:

stderr:
+ /agnhost connect --timeout=3s no-pods:80
TIMEOUT
command terminated with exit code 1

error:
exit status 1
...
Apr 24 21:25:53.743: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/network/service.go:2621]: Unexpected error:
    <*errors.errorString | 0xc000208960>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred
...
failed: (48.6s) 2020-04-24T21:25:53 "[sig-network] Services should be rejected when no endpoints exist [Skipped:Network/OVNKubernetes] [Skipped:ibmcloud] [Suite:openshift/conformance/parallel] [Suite:k8s]"


[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10242
[2]: https://storage.googleapis.com/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10242/build-log.txt

Comment 31 W. Trevor King 2020-05-04 19:57:14 UTC

Dropping a bunch of bracketed test metadata from the bug summary now that we have it in a comment.  Tooling like CI search and Sippy should be able to find this bug from the public comment, without us having to put the distracting noise in the bug summary.

Comment 32 Scott Dodson 2020-05-05 14:23:03 UTC

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1831684 via sippy so not sure if sippy was delayed or if it doesn't pick up on bugs after they've been marked public. That can likely be duped but it's also clear that isn't scoped to only clusters with RHEL workers so I'm leaving it as is for now. This appears to be blocking all merges into 4.4

Comment 33 Colin Walters 2020-05-05 16:02:30 UTC

The linked PR is closed, and I don't believe that we should stall all OpenShift releases until this test passes.  Is there a next step and who owns it?  I'd vote to skip the test for now.

Comment 35 Colin Walters 2020-05-07 00:21:18 UTC

OK so after a digression re-spinning up my libvirt development setup, I can clearly pin this on a kernel change from 8.1 to 8.2.

Scenario:

Spun up a 4.3.17 (RHCOS 8.1) cluster in libvirt, and ran:

$ ./_output/local/bin/linux/amd64/openshift-tests run openshift/conformance/parallel --run 'Services should be rejected'

many times - all passed.

Then used `oc debug node/` on a random worker, and downloaded the 8.2 kernel and switched to it:

# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ab14709af5c410dd36ebeda30809c791ef08328af2aaa87b23372e39c17af44d
              CustomOrigin: Managed by machine-config-operator
                   Version: 43.81.202004211653.0 (2020-04-21T16:58:52Z)
      ReplacedBasePackages: linux-firmware 20190516-94.git711d3297.el8 -> 20191202-97.gite8a0f4c9.el8, kernel-modules kernel-modules-extra kernel kernel-core 4.18.0-147.8.1.el8_1 -> 4.18.0-193.el8

I used `oc adm cordon` on the other workers to ensure the test always ran on the node with my updated kernel.

And now running the test consistently fails.  And just to re-verify, using `rpm-ostree reset -ol --reboot` and the test starts passing again.

Comment 36 Colin Walters 2020-05-07 00:33:44 UTC

Now looking at the kernel logs for changes since 8.1/8.2:

walters@toolbox ~/s/d/r/kernel> git log 11e3daec92f2a14a435eaedf05f855d14ff86739.. |grep icmp
    - [net] icmp: fix data-race in cmp_global_allow() (Guillaume Nault) [1801587]
    - [net] ipv4/icmp: fix rt dst dev null pointer dereference (Guillaume Nault) [1765639]
    - [net] bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER (Hangbin Liu) [1756799]

Comment 37 Colin Walters 2020-05-07 02:01:21 UTC

OK I started with

4.18.0-152.el8 as a bisection target based on the above, and this error still reproduces with that.  In fact the test fails more or less instantly, consistently.

I then went back to 4.18.0-151.el8 and got one failure, but the test started consistently succeeding after that.

So I think that points to changes from https://bugzilla.redhat.com/show_bug.cgi?id=1765639

Comment 38 mkumatag 2020-05-07 06:14:25 UTC

(In reply to Colin Walters from comment #35)
> OK so after a digression re-spinning up my libvirt development setup, I can
> clearly pin this on a kernel change from 8.1 to 8.2.
> 
> Scenario:
> 
> Spun up a 4.3.17 (RHCOS 8.1) cluster in libvirt, and ran:

I completely agree, on ppc64le platform we have started seeing this issue post 4.3.18 release(that is when actual switch from 8.1->8.2 kernel happened)
> 
> $ ./_output/local/bin/linux/amd64/openshift-tests run
> openshift/conformance/parallel --run 'Services should be rejected'
> 
> many times - all passed.
> 
> Then used `oc debug node/` on a random worker, and downloaded the 8.2 kernel
> and switched to it:
> 
> # rpm-ostree status
> State: idle
> AutomaticUpdates: disabled
> Deployments:
> *
> pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> ab14709af5c410dd36ebeda30809c791ef08328af2aaa87b23372e39c17af44d
>               CustomOrigin: Managed by machine-config-operator
>                    Version: 43.81.202004211653.0 (2020-04-21T16:58:52Z)
>       ReplacedBasePackages: linux-firmware 20190516-94.git711d3297.el8 ->
> 20191202-97.gite8a0f4c9.el8, kernel-modules kernel-modules-extra kernel
> kernel-core 4.18.0-147.8.1.el8_1 -> 4.18.0-193.el8
> 
> I used `oc adm cordon` on the other workers to ensure the test always ran on
> the node with my updated kernel.
> 
> And now running the test consistently fails.  And just to re-verify, using
> `rpm-ostree reset -ol --reboot` and the test starts passing again.

Comment 39 Colin Walters 2020-05-07 12:30:26 UTC

One thing that hints at this being an upstream change is the original comment that it's consistently failing on Fedora kernels.  The fact that it's been reproducing on RHEL7 would seem to rule that out but possibly it (or something similar) was backported there earlier?

I did find https://bugzilla.redhat.com/show_bug.cgi?id=1461282 which is another example of ICMP ratelimiting changes breaking tests.

Comment 40 Colin Walters 2020-05-07 17:28:02 UTC

See also https://bugzilla.redhat.com/show_bug.cgi?id=1832332

Comment 41 Jesus M. Rodriguez 2020-05-07 21:08:17 UTC

*** Bug 1833136 has been marked as a duplicate of this bug. ***

Comment 42 Christian Glombek 2020-05-08 02:04:58 UTC

As Colin said, and Vadim reported here, this was reproducible 100% of the times with OKD on Fedora/FCOS 31 with Kernel 5.x.
Because of this issue we switched OKD over from OpenShift-SDN to OVN where it has worked reliably without this failure since

Comment 43 Mansi Kulkarni 2020-05-12 19:05:36 UTC

[build-cop] this is failing consistently on:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.3/54

Comment 44 Colin Walters 2020-05-13 19:53:10 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1832332#c14 for the latest status on this.

We'll leave this bug open to track shipping that updated kernel fix in RHCOS/OCP.

Comment 45 Colin Walters 2020-05-14 20:37:20 UTC

I want to outline some options here from my PoV, and they aren't mutually exclusive:

- Get patched kernel attached to 8.2.z errata and ship it before 8.2.2 ships (RHEL kernel team + stakeholders agreeing to ship in OpenShift before errata ships)
- Chase down whatever in the SDN or test framework is causing the ICMP redirects (OpenShift SDN team)
- Ship 8.2.0 as is and temporarily disable the test https://github.com/openshift/origin/pull/24980/  (OpenShift architects decision)

Comment 49 Aniket Bhat 2020-09-08 18:44:45 UTC

@Micah: how can we verify if the 8.2.z fix for this bug is in the latest RHCOS?

Comment 50 Ben Bennett 2020-09-08 20:05:12 UTC

It looks like the 4.4 kernels available have the fixes: 

- http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.4-el8/building/x86_64/os/Packages/
- http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.5-el8/building/x86_64/os/Packages/
- http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/plashets/4.6-el8/building/x86_64/os/Packages/

Comment 53 Juan Luis de Sousa-Valadas 2020-09-10 13:12:02 UTC

Removing OKDBlocker, this affects OCP as well so it's unreasonable to block OKD for this reason. Also the test is skipped.

Comment 54 Micah Abbott 2020-09-16 20:19:48 UTC

(In reply to Aniket Bhat from comment #49)
> @Micah: how can we verify if the 8.2.z fix for this bug is in the latest
> RHCOS?

The underlying kernel issue appears to be BZ#1836302

From there, the BZ shows that the issue is fixed in kernel-4.18.0-193.7.1.el8_2

To find if that kernel (or newer) is in the latest RHCOS 4.6, use the release browser to show the OS Contents - 

For example, the latest RHCOS 4.6 build - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6&release=46.82.202009161140-0#46.82.202009161140-0

And the OS Contents for that build - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/contents.html?stream=releases%2Frhcos-4.6&release=46.82.202009161140-0

Which shows that it contains kernel-4.18.0-193.19.1.el8_2

So it should be safe to test with the latest RHCOS 4.6 builds to see if this issue has been properly fixed.

Comment 57 Juan Luis de Sousa-Valadas 2020-10-13 09:02:02 UTC

There are no open cases and it's fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1832332
It's likely to be still broken in 4.4, but since nobody seems to care I'm closing this

Comment 58 Tobias Florek 2020-11-30 11:31:01 UTC

Aww.  It's unfortunate, that https://bugzilla.redhat.com/show_bug.cgi?id=1832332 is classified.  Any reason for it?

Comment 59 Dan Williams 2020-11-30 15:04:43 UTC

(In reply to Tobias Florek from comment #58)
> Aww.  It's unfortunate, that
> https://bugzilla.redhat.com/show_bug.cgi?id=1832332 is classified.  Any
> reason for it?

The TLDR is https://lore.kernel.org/netdev/7f71c9a7ba0d514c9f2d006f4797b044c824ae84.1588954755.git.pabeni@redhat.com/T/#u fixes the problem.

Note You need to log in before you can comment on or make changes to this bug.

aconstan
anbhat
bbennett
cglombek
danw
dcbw
dornelas
ecordell
jdesousa
jesusr
jiji
mankulka
me
miabbott
mkumatag
periklis
ricarril
rteague
sdodson
sukulkar
walters
wking