Bug 1736002 - iptables loadbalancing is not balanced
Summary: iptables loadbalancing is not balanced
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.2.0
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
Depends On:
TreeView+ depends on / blocked
Reported: 2019-08-01 13:46 UTC by Dan Winship
Modified: 2019-08-20 14:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-08-20 14:02:47 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Dan Winship 2019-08-01 13:46:14 UTC
from bug 1734509:

    It looks like is consistently busy,
    is consistently somewhat less busy, and is
    suspiciously slacking off almost the whole time. Counting the
    number of kube-apiserver log messages in each 10-minute period:

             .165   .184   .216

    15:0x    1924   1489   1635
    15:1x    1696    995    249
    15:2x    1368    654     62
    15:3x    1406    700     95
    15:4x    1053    534    103
    15:5x     440    184      8
    16:0x      92     40      8

    This corresponds to the ordering of the endpoints in the iptables
    rules, so it seems like iptables isn't actually balancing
    connections correctly.

This is not a new bug. It appears to have always been this way and we just didn't notice. (Or at least, it also shows up in the logs of a randomly-selected test run from January.)

The iptables rules:

-A KUBE-SERVICES -d -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y

-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-ZDFKTDCPOS2CD6PV
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-GKTS3YR2HYOAX2GL

-A KUBE-SEP-ZDFKTDCPOS2CD6PV -p tcp -m tcp -j DNAT --to-destination
-A KUBE-SEP-GKTS3YR2HYOAX2GL -p tcp -m tcp -j DNAT --to-destination
-A KUBE-SEP-DJPSN5YGTYNXPKQR -p tcp -m tcp -j DNAT --to-destination

Ignoring the rounding error, this *should* work: the first KUBE-SVC-NPX4... rule matches 1/3 of packets, the second matches 1/2 of the packets that didn't match the first rule, and the last matches all of the packets that didn't match either of the first two rules. That should give us 1/3 / 1/3 / 1/3. But apparently it doesn't.

Comment 1 Casey Callendrello 2019-08-05 16:01:56 UTC
This is probably just a long-running connection problem. API server is disrupted, everyone reconnects to the 1 or 2 available endpoints, then never ever disconnect.

Perhaps client-go should have an exponentially-distributed random reconnection interval?

Comment 2 Dan Winship 2019-08-12 09:51:15 UTC
Oh, interesting. That should be easy to prove if so. (See if a service with lots of short connections shows the same distribution.) I guess if that is what the problem is, then the next question is "does it really matter or are we fine with the fact that some apiservers work harder than others?"

Comment 3 Casey Callendrello 2019-08-12 11:59:43 UTC
I "tested" this some time ago, when we threw apachebench against a throw-away service. Well, more precisely, a colleague walked up to my desk, asking why load-balancing was broken. They had seen very uneven load-balancing despite issuing many homogeneous requests.

It turned out that apachebench uses keep-alive by default, and never ever reconnects. Random load-balancing is only effective for a large number of requests, of course. So, their 10-or-so load-generation connections were balanced unevenly since there just weren't enough coin-flips to regress to the mean.

Comment 4 Casey Callendrello 2019-08-20 14:02:47 UTC
Marking this as NOTABUG - I think we're fine here.

Note You need to log in before you can comment on or make changes to this bug.