Bug 1825255

Summary:

Long timeout when connecting to services without endpoints

Product:

OpenShift Container Platform

Reporter:

Sergio G. <sgarciam>

Component:

Networking

Assignee:

Juan Luis de Sousa-Valadas <jdesousa>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

andcosta, bbennett, jdesousa, zzhao

Version:

4.3.z

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

SDN-CUST-IMPACT SDN-CI-IMPACT SDN-BP

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-09-07 14:20:00 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1781575, 1832332, 1834184

Bug Blocks:

Attachments:

Description	Flags
reproducer for 4.2 including information about the cluster and iptables	none
reproducer for 4.3 including information about the cluster and iptables	none

Description Sergio G. 2020-04-17 13:45:34 UTC

Description of problem:
Reaching services without endpoints takes 2 minutes to fail (timeout)


Version-Release number of selected component (if applicable):
Reproduced in: 
 - 4.2.21
 - 4.3.12


How reproducible:


Steps to Reproduce:
1. Create a new project and a service on it:
 $ oc new-project test-timeout
 $ cat <<EOF | oc create -f -
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: httpd
EOF

2. Run a test pod to run curls from it
 $ oc run sleep --image=rhel7/rhel-tools --command -- sleep 3600
 $ oc exec sleep-1-wbpsj -- curl http://service:8080


Actual results:
The curl takes two minutes before timing out.


Expected results:
The curl should fail with a connection refused.

Comment 1 Sergio G. 2020-04-17 13:46:40 UTC

Created attachment 1679669 [details]
reproducer for 4.2 including information about the cluster and iptables

Comment 2 Sergio G. 2020-04-17 13:46:58 UTC

Created attachment 1679670 [details]
reproducer for 4.3 including information about the cluster and iptables

Comment 3 Sergio G. 2020-04-17 13:53:32 UTC

Probably related with https://bugzilla.redhat.com/show_bug.cgi?id=1782857 but according to it, this should be already fixed.

Comment 8 zhaozhanqi 2020-04-27 12:39:31 UTC

hi, Juan , this cluster will be destroyed by automated in 2 days. So if you find this cannot be used. you can ask weliang or anusaxen for helping.

Comment 10 Juan Luis de Sousa-Valadas 2020-04-27 14:26:58 UTC

Andre, I think I found the issue, we're hitting the icmp rate limit.

Can you ask the customer to try this is *exactly* the same issue they are seeing?
1- oc debug node <node name>
On that terminal:
2- keep the output of: cat /proc/sys/net/ipv4/icmp_ratemask
3- Temporarly disable icmp rate limit: echo 0 >  /proc/sys/net/ipv4/icmp_ratemask
4- On a diferent terminal: Verify if the issue still happens or if at least it happens intermittently.
5- On the previous terminal: Unless this has a business impact for them, on the terminal from step 1: echo <value from 1> > /proc/sys/net/ipv4/icmp_ratemask

The reason to restore the rate limit is that currently we don't know why are there so many icmp packets or how many are there. Therefore if the limit is removed permanently on every node I cannot guarantee that it won't cause problems in the network, if it's a short period of time on just one node it shouldn't be noticeable.

Comment 12 Ben Bennett 2020-05-08 20:33:58 UTC

@Juan -- can we see if we can either change the rate limit, or change the rate mask to allow the icmp reporting no connection?

Comment 14 Juan Luis de Sousa-Valadas 2020-09-07 14:20:00 UTC

Case is closed so this probably doesn't need a backport.
Newer releases with newer kernels don't have this issue. Reopen this if a customer actually wants a backport.