Bug 1825255

Summary: Long timeout when connecting to services without endpoints
Product: OpenShift Container Platform Reporter: Sergio G. <sgarciam>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: andcosta, bbennett, jdesousa, zzhao
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SDN-CUST-IMPACT SDN-CI-IMPACT SDN-BP
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-07 14:20:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1781575, 1832332, 1834184    
Bug Blocks:    
Attachments:
Description Flags
reproducer for 4.2 including information about the cluster and iptables
none
reproducer for 4.3 including information about the cluster and iptables none

Description Sergio G. 2020-04-17 13:45:34 UTC
Description of problem:
Reaching services without endpoints takes 2 minutes to fail (timeout)


Version-Release number of selected component (if applicable):
Reproduced in: 
 - 4.2.21
 - 4.3.12


How reproducible:


Steps to Reproduce:
1. Create a new project and a service on it:
 $ oc new-project test-timeout
 $ cat <<EOF | oc create -f -
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: httpd
EOF

2. Run a test pod to run curls from it
 $ oc run sleep --image=rhel7/rhel-tools --command -- sleep 3600
 $ oc exec sleep-1-wbpsj -- curl http://service:8080


Actual results:
The curl takes two minutes before timing out.


Expected results:
The curl should fail with a connection refused.

Comment 1 Sergio G. 2020-04-17 13:46:40 UTC
Created attachment 1679669 [details]
reproducer for 4.2 including information about the cluster and iptables

Comment 2 Sergio G. 2020-04-17 13:46:58 UTC
Created attachment 1679670 [details]
reproducer for 4.3 including information about the cluster and iptables

Comment 3 Sergio G. 2020-04-17 13:53:32 UTC
Probably related with https://bugzilla.redhat.com/show_bug.cgi?id=1782857 but according to it, this should be already fixed.

Comment 8 zhaozhanqi 2020-04-27 12:39:31 UTC
hi, Juan , this cluster will be destroyed by automated in 2 days. So if you find this cannot be used. you can ask weliang or anusaxen for helping.

Comment 10 Juan Luis de Sousa-Valadas 2020-04-27 14:26:58 UTC
Andre, I think I found the issue, we're hitting the icmp rate limit.

Can you ask the customer to try this is *exactly* the same issue they are seeing?
1- oc debug node <node name>
On that terminal:
2- keep the output of: cat /proc/sys/net/ipv4/icmp_ratemask
3- Temporarly disable icmp rate limit: echo 0 >  /proc/sys/net/ipv4/icmp_ratemask
4- On a diferent terminal: Verify if the issue still happens or if at least it happens intermittently.
5- On the previous terminal: Unless this has a business impact for them, on the terminal from step 1: echo <value from 1> > /proc/sys/net/ipv4/icmp_ratemask

The reason to restore the rate limit is that currently we don't know why are there so many icmp packets or how many are there. Therefore if the limit is removed permanently on every node I cannot guarantee that it won't cause problems in the network, if it's a short period of time on just one node it shouldn't be noticeable.

Comment 12 Ben Bennett 2020-05-08 20:33:58 UTC
@Juan -- can we see if we can either change the rate limit, or change the rate mask to allow the icmp reporting no connection?

Comment 14 Juan Luis de Sousa-Valadas 2020-09-07 14:20:00 UTC
Case is closed so this probably doesn't need a backport.
Newer releases with newer kernels don't have this issue. Reopen this if a customer actually wants a backport.