Bug 1782857

Summary:	Connections from pod to services without endpoints are not rejected immediately
Product:	OpenShift Container Platform	Reporter:	Juan Luis de Sousa-Valadas <jdesousa>
Component:	Networking	Assignee:	Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	low
Priority:	unspecified
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-23 11:49:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Juan Luis de Sousa-Valadas 2019-12-12 14:01:16 UTC

Description of problem:
Kube-proxy creates a REJECT rule in the filter table in the KUBE-SERVICES chain which is part of OUTPUT.

The rule looks like:
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.239.59        /* telar/hello-openshift:8888-tcp has no endpoints */ tcp dpt:8888 reject-with icmp-port-unreachable

However, because the packet is created in a different net namespace it never gets to the filter table and the packet iptables doesn't try to match it against the filter table and hence it's not rejected immediately.

So what will happen is that the ARP resolution will fail, which is much slower, and it's aggravated if there are a lot of attempts, check actual results vs expected results.
This is a problem because if we do a lot of connections

Version-Release number of selected component (if applicable):
3.11
Most certainly 4.4 is also affected...

How reproducible:
Always

Steps to Reproduce:
1. oc new-project test-filter
2. oc new-app openshift/hello openshift
3. oc scale dc/hello openshift
4. oc new-app httpd
5. ssh to the node using the httpd pod
6. nsenter the httpd pod net namespace (nsenter -n -t <pid>)
7. echo "GET http://<svc ip>:<svc port>" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report

Actual results:
$ sudo nsenter -n -t 3978
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        21.046354763s, 4.979783809s, 16.066570954s
Latencies     [mean, 50, 90, 95, 99, max]  7.281432594s, 7.915426174s, 17.769387549s, 17.919384172s, 18.019570465s, 18.03946945s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: no route to host


Expected results:
I created an iptables rule in the pod's net namespace to compare timing.
$ sudo nsenter -n -t 3978
# iptables -A OUTPUT -d 172.30.239.59/32 -p tcp -m comment --comment "telar/hello-openshift:8080-tcp has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        5.981782675s, 4.979960498s, 1.001822177s
Latencies     [mean, 50, 90, 95, 99, max]  1.0017498s, 1.001660707s, 1.001843035s, 1.002594677s, 1.003528148s, 1.005008586s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: connection refused

When adding the rule we can see a much much better performance refusing the connections.

Additional info:
1- I'm using vegeta which we don't support because it's what the customer provided, we can probably use ab (apache benchmark) instead.
2- We need to actually check the impact on /proc/sys/net/ipv4/icmp_msgs_burst
3- Reject is only accepted in the filter table, so I'm not quite sure how to better implement this, perhaps doing a DNAT to the service IP fixes it but I'd need to check.

Comment 1 Casey Callendrello 2019-12-12 14:42:48 UTC

This is a known upstream bug; https://github.com/kubernetes/kubernetes/pull/72534 is the fix. That's in v1.14. We could backport that if there's demand.

Comment 2 Juan Luis de Sousa-Valadas 2019-12-23 11:49:47 UTC

Customer closed the case and has not requested for it to be backported.

If one customer needs this we can backport it, but I'm closing this if nobody requests it.