Bug 1782857

Summary: Connections from pod to services without endpoints are not rejected immediately
Product: OpenShift Container Platform Reporter: Juan Luis de Sousa-Valadas <jdesousa>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: unspecified    
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-23 11:49:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juan Luis de Sousa-Valadas 2019-12-12 14:01:16 UTC
Description of problem:
Kube-proxy creates a REJECT rule in the filter table in the KUBE-SERVICES chain which is part of OUTPUT.

The rule looks like:
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.239.59        /* telar/hello-openshift:8888-tcp has no endpoints */ tcp dpt:8888 reject-with icmp-port-unreachable

However, because the packet is created in a different net namespace it never gets to the filter table and the packet iptables doesn't try to match it against the filter table and hence it's not rejected immediately.

So what will happen is that the ARP resolution will fail, which is much slower, and it's aggravated if there are a lot of attempts, check actual results vs expected results.
This is a problem because if we do a lot of connections

Version-Release number of selected component (if applicable):
3.11
Most certainly 4.4 is also affected...

How reproducible:
Always

Steps to Reproduce:
1. oc new-project test-filter
2. oc new-app openshift/hello openshift
3. oc scale dc/hello openshift
4. oc new-app httpd
5. ssh to the node using the httpd pod
6. nsenter the httpd pod net namespace (nsenter -n -t <pid>)
7. echo "GET http://<svc ip>:<svc port>" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report

Actual results:
$ sudo nsenter -n -t 3978
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        21.046354763s, 4.979783809s, 16.066570954s
Latencies     [mean, 50, 90, 95, 99, max]  7.281432594s, 7.915426174s, 17.769387549s, 17.919384172s, 18.019570465s, 18.03946945s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: no route to host


Expected results:
I created an iptables rule in the pod's net namespace to compare timing.
$ sudo nsenter -n -t 3978
# iptables -A OUTPUT -d 172.30.239.59/32 -p tcp -m comment --comment "telar/hello-openshift:8080-tcp has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        5.981782675s, 4.979960498s, 1.001822177s
Latencies     [mean, 50, 90, 95, 99, max]  1.0017498s, 1.001660707s, 1.001843035s, 1.002594677s, 1.003528148s, 1.005008586s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: connection refused

When adding the rule we can see a much much better performance refusing the connections.

Additional info:
1- I'm using vegeta which we don't support because it's what the customer provided, we can probably use ab (apache benchmark) instead.
2- We need to actually check the impact on /proc/sys/net/ipv4/icmp_msgs_burst
3- Reject is only accepted in the filter table, so I'm not quite sure how to better implement this, perhaps doing a DNAT to the service IP fixes it but I'd need to check.

Comment 1 Casey Callendrello 2019-12-12 14:42:48 UTC
This is a known upstream bug; https://github.com/kubernetes/kubernetes/pull/72534 is the fix. That's in v1.14. We could backport that if there's demand.

Comment 2 Juan Luis de Sousa-Valadas 2019-12-23 11:49:47 UTC
Customer closed the case and has not requested for it to be backported.

If one customer needs this we can backport it, but I'm closing this if nobody requests it.