Bug 1782857 - Connections from pod to services without endpoints are not rejected immediately
Summary: Connections from pod to services without endpoints are not rejected immediately
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: All
OS: Linux
unspecified
low
Target Milestone: ---
: 3.11.z
Assignee: Juan Luis de Sousa-Valadas
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-12 14:01 UTC by Juan Luis de Sousa-Valadas
Modified: 2023-03-24 16:25 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-23 11:49:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Juan Luis de Sousa-Valadas 2019-12-12 14:01:16 UTC
Description of problem:
Kube-proxy creates a REJECT rule in the filter table in the KUBE-SERVICES chain which is part of OUTPUT.

The rule looks like:
    0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.239.59        /* telar/hello-openshift:8888-tcp has no endpoints */ tcp dpt:8888 reject-with icmp-port-unreachable

However, because the packet is created in a different net namespace it never gets to the filter table and the packet iptables doesn't try to match it against the filter table and hence it's not rejected immediately.

So what will happen is that the ARP resolution will fail, which is much slower, and it's aggravated if there are a lot of attempts, check actual results vs expected results.
This is a problem because if we do a lot of connections

Version-Release number of selected component (if applicable):
3.11
Most certainly 4.4 is also affected...

How reproducible:
Always

Steps to Reproduce:
1. oc new-project test-filter
2. oc new-app openshift/hello openshift
3. oc scale dc/hello openshift
4. oc new-app httpd
5. ssh to the node using the httpd pod
6. nsenter the httpd pod net namespace (nsenter -n -t <pid>)
7. echo "GET http://<svc ip>:<svc port>" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report

Actual results:
$ sudo nsenter -n -t 3978
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        21.046354763s, 4.979783809s, 16.066570954s
Latencies     [mean, 50, 90, 95, 99, max]  7.281432594s, 7.915426174s, 17.769387549s, 17.919384172s, 18.019570465s, 18.03946945s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: no route to host


Expected results:
I created an iptables rule in the pod's net namespace to compare timing.
$ sudo nsenter -n -t 3978
# iptables -A OUTPUT -d 172.30.239.59/32 -p tcp -m comment --comment "telar/hello-openshift:8080-tcp has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
# echo "GET http://172.30.239.59:8080" | /tmp/vegeta attack -duration=5s | tee results.bin | /tmp/vegeta report
Requests      [total, rate, throughput]    250, 50.20, 0.00
Duration      [total, attack, wait]        5.981782675s, 4.979960498s, 1.001822177s
Latencies     [mean, 50, 90, 95, 99, max]  1.0017498s, 1.001660707s, 1.001843035s, 1.002594677s, 1.003528148s, 1.005008586s
Bytes In      [total, mean]                0, 0.00
Bytes Out     [total, mean]                0, 0.00
Success       [ratio]                      0.00%
Status Codes  [code:count]                 0:250  
Error Set:
Get http://172.30.239.59:8080: dial tcp 0.0.0.0:0->172.30.239.59:8080: connect: connection refused

When adding the rule we can see a much much better performance refusing the connections.

Additional info:
1- I'm using vegeta which we don't support because it's what the customer provided, we can probably use ab (apache benchmark) instead.
2- We need to actually check the impact on /proc/sys/net/ipv4/icmp_msgs_burst
3- Reject is only accepted in the filter table, so I'm not quite sure how to better implement this, perhaps doing a DNAT to the service IP fixes it but I'd need to check.

Comment 1 Casey Callendrello 2019-12-12 14:42:48 UTC
This is a known upstream bug; https://github.com/kubernetes/kubernetes/pull/72534 is the fix. That's in v1.14. We could backport that if there's demand.

Comment 2 Juan Luis de Sousa-Valadas 2019-12-23 11:49:47 UTC
Customer closed the case and has not requested for it to be backported.

If one customer needs this we can backport it, but I'm closing this if nobody requests it.


Note You need to log in before you can comment on or make changes to this bug.