from bug 1734509:
It looks like 10.0.135.165 is consistently busy, 10.0.143.184
is consistently somewhat less busy, and 10.0.155.216 is
suspiciously slacking off almost the whole time. Counting the
number of kube-apiserver log messages in each 10-minute period:
.165 .184 .216
15:0x 1924 1489 1635
15:1x 1696 995 249
15:2x 1368 654 62
15:3x 1406 700 95
15:4x 1053 534 103
15:5x 440 184 8
16:0x 92 40 8
This corresponds to the ordering of the endpoints in the iptables
rules, so it seems like iptables isn't actually balancing
This is not a new bug. It appears to have always been this way and we just didn't notice. (Or at least, it also shows up in the logs of a randomly-selected test run from January.)
The iptables rules:
-A KUBE-SERVICES -d 172.30.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-ZDFKTDCPOS2CD6PV
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-GKTS3YR2HYOAX2GL
-A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-DJPSN5YGTYNXPKQR
-A KUBE-SEP-ZDFKTDCPOS2CD6PV -p tcp -m tcp -j DNAT --to-destination 10.0.135.165:6443
-A KUBE-SEP-GKTS3YR2HYOAX2GL -p tcp -m tcp -j DNAT --to-destination 10.0.143.184:6443
-A KUBE-SEP-DJPSN5YGTYNXPKQR -p tcp -m tcp -j DNAT --to-destination 10.0.155.216:6443
Ignoring the rounding error, this *should* work: the first KUBE-SVC-NPX4... rule matches 1/3 of packets, the second matches 1/2 of the packets that didn't match the first rule, and the last matches all of the packets that didn't match either of the first two rules. That should give us 1/3 / 1/3 / 1/3. But apparently it doesn't.
This is probably just a long-running connection problem. API server is disrupted, everyone reconnects to the 1 or 2 available endpoints, then never ever disconnect.
Perhaps client-go should have an exponentially-distributed random reconnection interval?
Oh, interesting. That should be easy to prove if so. (See if a service with lots of short connections shows the same distribution.) I guess if that is what the problem is, then the next question is "does it really matter or are we fine with the fact that some apiservers work harder than others?"
I "tested" this some time ago, when we threw apachebench against a throw-away service. Well, more precisely, a colleague walked up to my desk, asking why load-balancing was broken. They had seen very uneven load-balancing despite issuing many homogeneous requests.
It turned out that apachebench uses keep-alive by default, and never ever reconnects. Random load-balancing is only effective for a large number of requests, of course. So, their 10-or-so load-generation connections were balanced unevenly since there just weren't enough coin-flips to regress to the mean.
Marking this as NOTABUG - I think we're fine here.