Bug 1825255 - Long timeout when connecting to services without endpoints
Summary: Long timeout when connecting to services without endpoints
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Juan Luis de Sousa-Valadas
QA Contact: zhaozhanqi
URL:
Whiteboard: SDN-CUST-IMPACT SDN-CI-IMPACT SDN-BP
Depends On: 1781575 1832332 1834184
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-17 13:45 UTC by Sergio G.
Modified: 2021-01-07 14:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-07 14:20:00 UTC
Target Upstream Version:


Attachments (Terms of Use)
reproducer for 4.2 including information about the cluster and iptables (94.90 KB, text/plain)
2020-04-17 13:46 UTC, Sergio G.
no flags Details
reproducer for 4.3 including information about the cluster and iptables (103.18 KB, text/plain)
2020-04-17 13:46 UTC, Sergio G.
no flags Details

Description Sergio G. 2020-04-17 13:45:34 UTC
Description of problem:
Reaching services without endpoints takes 2 minutes to fail (timeout)


Version-Release number of selected component (if applicable):
Reproduced in: 
 - 4.2.21
 - 4.3.12


How reproducible:


Steps to Reproduce:
1. Create a new project and a service on it:
 $ oc new-project test-timeout
 $ cat <<EOF | oc create -f -
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: httpd
EOF

2. Run a test pod to run curls from it
 $ oc run sleep --image=rhel7/rhel-tools --command -- sleep 3600
 $ oc exec sleep-1-wbpsj -- curl http://service:8080


Actual results:
The curl takes two minutes before timing out.


Expected results:
The curl should fail with a connection refused.

Comment 1 Sergio G. 2020-04-17 13:46:40 UTC
Created attachment 1679669 [details]
reproducer for 4.2 including information about the cluster and iptables

Comment 2 Sergio G. 2020-04-17 13:46:58 UTC
Created attachment 1679670 [details]
reproducer for 4.3 including information about the cluster and iptables

Comment 3 Sergio G. 2020-04-17 13:53:32 UTC
Probably related with https://bugzilla.redhat.com/show_bug.cgi?id=1782857 but according to it, this should be already fixed.

Comment 8 zhaozhanqi 2020-04-27 12:39:31 UTC
hi, Juan , this cluster will be destroyed by automated in 2 days. So if you find this cannot be used. you can ask weliang or anusaxen for helping.

Comment 10 Juan Luis de Sousa-Valadas 2020-04-27 14:26:58 UTC
Andre, I think I found the issue, we're hitting the icmp rate limit.

Can you ask the customer to try this is *exactly* the same issue they are seeing?
1- oc debug node <node name>
On that terminal:
2- keep the output of: cat /proc/sys/net/ipv4/icmp_ratemask
3- Temporarly disable icmp rate limit: echo 0 >  /proc/sys/net/ipv4/icmp_ratemask
4- On a diferent terminal: Verify if the issue still happens or if at least it happens intermittently.
5- On the previous terminal: Unless this has a business impact for them, on the terminal from step 1: echo <value from 1> > /proc/sys/net/ipv4/icmp_ratemask

The reason to restore the rate limit is that currently we don't know why are there so many icmp packets or how many are there. Therefore if the limit is removed permanently on every node I cannot guarantee that it won't cause problems in the network, if it's a short period of time on just one node it shouldn't be noticeable.

Comment 12 Ben Bennett 2020-05-08 20:33:58 UTC
@Juan -- can we see if we can either change the rate limit, or change the rate mask to allow the icmp reporting no connection?

Comment 14 Juan Luis de Sousa-Valadas 2020-09-07 14:20:00 UTC
Case is closed so this probably doesn't need a backport.
Newer releases with newer kernels don't have this issue. Reopen this if a customer actually wants a backport.


Note You need to log in before you can comment on or make changes to this bug.