1825255 – Long timeout when connecting to services without endpoints

Bug 1825255 - Long timeout when connecting to services without endpoints

Summary: Long timeout when connecting to services without endpoints

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Juan Luis de Sousa-Valadas
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	SDN-CUST-IMPACT SDN-CI-IMPACT SDN-BP
Depends On:	1781575 1832332 1834184
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-17 13:45 UTC by Sergio G.
Modified:	2023-10-06 19:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-07 14:20:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reproducer for 4.2 including information about the cluster and iptables (94.90 KB, text/plain) 2020-04-17 13:46 UTC, Sergio G.	no flags	Details
reproducer for 4.3 including information about the cluster and iptables (103.18 KB, text/plain) 2020-04-17 13:46 UTC, Sergio G.	no flags	Details
View All

Description Sergio G. 2020-04-17 13:45:34 UTC

Description of problem:
Reaching services without endpoints takes 2 minutes to fail (timeout)


Version-Release number of selected component (if applicable):
Reproduced in: 
 - 4.2.21
 - 4.3.12


How reproducible:


Steps to Reproduce:
1. Create a new project and a service on it:
 $ oc new-project test-timeout
 $ cat <<EOF | oc create -f -
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: httpd
EOF

2. Run a test pod to run curls from it
 $ oc run sleep --image=rhel7/rhel-tools --command -- sleep 3600
 $ oc exec sleep-1-wbpsj -- curl http://service:8080


Actual results:
The curl takes two minutes before timing out.


Expected results:
The curl should fail with a connection refused.

Comment 1 Sergio G. 2020-04-17 13:46:40 UTC

Created attachment 1679669 [details]
reproducer for 4.2 including information about the cluster and iptables

Comment 2 Sergio G. 2020-04-17 13:46:58 UTC

Created attachment 1679670 [details]
reproducer for 4.3 including information about the cluster and iptables

Comment 3 Sergio G. 2020-04-17 13:53:32 UTC

Probably related with https://bugzilla.redhat.com/show_bug.cgi?id=1782857 but according to it, this should be already fixed.

Comment 8 zhaozhanqi 2020-04-27 12:39:31 UTC

hi, Juan , this cluster will be destroyed by automated in 2 days. So if you find this cannot be used. you can ask weliang or anusaxen for helping.

Comment 10 Juan Luis de Sousa-Valadas 2020-04-27 14:26:58 UTC

Andre, I think I found the issue, we're hitting the icmp rate limit.

Can you ask the customer to try this is *exactly* the same issue they are seeing?
1- oc debug node <node name>
On that terminal:
2- keep the output of: cat /proc/sys/net/ipv4/icmp_ratemask
3- Temporarly disable icmp rate limit: echo 0 >  /proc/sys/net/ipv4/icmp_ratemask
4- On a diferent terminal: Verify if the issue still happens or if at least it happens intermittently.
5- On the previous terminal: Unless this has a business impact for them, on the terminal from step 1: echo <value from 1> > /proc/sys/net/ipv4/icmp_ratemask

The reason to restore the rate limit is that currently we don't know why are there so many icmp packets or how many are there. Therefore if the limit is removed permanently on every node I cannot guarantee that it won't cause problems in the network, if it's a short period of time on just one node it shouldn't be noticeable.

Comment 12 Ben Bennett 2020-05-08 20:33:58 UTC

@Juan -- can we see if we can either change the rate limit, or change the rate mask to allow the icmp reporting no connection?

Comment 14 Juan Luis de Sousa-Valadas 2020-09-07 14:20:00 UTC

Case is closed so this probably doesn't need a backport.
Newer releases with newer kernels don't have this issue. Reopen this if a customer actually wants a backport.

Note You need to log in before you can comment on or make changes to this bug.