Belated update from Friday: it looks like the problem is:
1. They have so many iptables rules that even things like
"iptables -C" take a long time because of iptables API
awfulness, and:
2. Because of oddities in RHEL backporting and k8s iptables
feature detection, OCP on RHEL 7 decides that
"iptables-restore" supports "--wait=2", but "iptables" only
supports "--wait" (forever).
3. So the random periodic /sbin/iptables resync calls end up causing
kube-proxy's iptables-restore calls to time out and fail.
4. Fix: bump iptablesSyncPeriod up to something ridiculously high
This should hopefully get the customer's cluster stable enough that they can progress with their upgrade plans. We should look into having this work better out-of-the-box in 3.11 at least.
Sorry, yeah, this didn't merge until after 4.3 split off so the bug should have been moved to 4.4. Ignore the comments from the errata system; it's lying.
Belated update from Friday: it looks like the problem is: 1. They have so many iptables rules that even things like "iptables -C" take a long time because of iptables API awfulness, and: 2. Because of oddities in RHEL backporting and k8s iptables feature detection, OCP on RHEL 7 decides that "iptables-restore" supports "--wait=2", but "iptables" only supports "--wait" (forever). 3. So the random periodic /sbin/iptables resync calls end up causing kube-proxy's iptables-restore calls to time out and fail. 4. Fix: bump iptablesSyncPeriod up to something ridiculously high This should hopefully get the customer's cluster stable enough that they can progress with their upgrade plans. We should look into having this work better out-of-the-box in 3.11 at least.