from bug 1734509: It looks like 10.0.135.165 is consistently busy, 10.0.143.184 is consistently somewhat less busy, and 10.0.155.216 is suspiciously slacking off almost the whole time. Counting the number of kube-apiserver log messages in each 10-minute period: .165 .184 .216 15:0x 1924 1489 1635 15:1x 1696 995 249 15:2x 1368 654 62 15:3x 1406 700 95 15:4x 1053 534 103 15:5x 440 184 8 16:0x 92 40 8 This corresponds to the ordering of the endpoints in the iptables rules, so it seems like iptables isn't actually balancing connections correctly. This is not a new bug. It appears to have always been this way and we just didn't notice. (Or at least, it also shows up in the logs of a randomly-selected test run from January.) The iptables rules: -A KUBE-SERVICES -d 172.30.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-ZDFKTDCPOS2CD6PV -A KUBE-SVC-NPX46M4PTMTKRN6Y -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-GKTS3YR2HYOAX2GL -A KUBE-SVC-NPX46M4PTMTKRN6Y -j KUBE-SEP-DJPSN5YGTYNXPKQR -A KUBE-SEP-ZDFKTDCPOS2CD6PV -p tcp -m tcp -j DNAT --to-destination 10.0.135.165:6443 -A KUBE-SEP-GKTS3YR2HYOAX2GL -p tcp -m tcp -j DNAT --to-destination 10.0.143.184:6443 -A KUBE-SEP-DJPSN5YGTYNXPKQR -p tcp -m tcp -j DNAT --to-destination 10.0.155.216:6443 Ignoring the rounding error, this *should* work: the first KUBE-SVC-NPX4... rule matches 1/3 of packets, the second matches 1/2 of the packets that didn't match the first rule, and the last matches all of the packets that didn't match either of the first two rules. That should give us 1/3 / 1/3 / 1/3. But apparently it doesn't.
This is probably just a long-running connection problem. API server is disrupted, everyone reconnects to the 1 or 2 available endpoints, then never ever disconnect. Perhaps client-go should have an exponentially-distributed random reconnection interval?
Oh, interesting. That should be easy to prove if so. (See if a service with lots of short connections shows the same distribution.) I guess if that is what the problem is, then the next question is "does it really matter or are we fine with the fact that some apiservers work harder than others?"
I "tested" this some time ago, when we threw apachebench against a throw-away service. Well, more precisely, a colleague walked up to my desk, asking why load-balancing was broken. They had seen very uneven load-balancing despite issuing many homogeneous requests. It turned out that apachebench uses keep-alive by default, and never ever reconnects. Random load-balancing is only effective for a large number of requests, of course. So, their 10-or-so load-generation connections were balanced unevenly since there just weren't enough coin-flips to regress to the mean.
Marking this as NOTABUG - I think we're fine here.