Belated update from Friday: it looks like the problem is: 1. They have so many iptables rules that even things like "iptables -C" take a long time because of iptables API awfulness, and: 2. Because of oddities in RHEL backporting and k8s iptables feature detection, OCP on RHEL 7 decides that "iptables-restore" supports "--wait=2", but "iptables" only supports "--wait" (forever). 3. So the random periodic /sbin/iptables resync calls end up causing kube-proxy's iptables-restore calls to time out and fail. 4. Fix: bump iptablesSyncPeriod up to something ridiculously high This should hopefully get the customer's cluster stable enough that they can progress with their upgrade plans. We should look into having this work better out-of-the-box in 3.11 at least.
So in the ose enterprise code we should set the variable WaitSecondsMinVersion = "1.4.21" instead of what is set, "1.4.22", so that we make use of the --wait=seconds feature since this was backported via bug 1438597 https://github.com/openshift/ose/blob/enterprise-3.11/vendor/k8s.io/kubernetes/pkg/util/iptables/iptables.go#L127
Although its ON_QA but this needs to be backported to 4.3 as confirmed with Dan Winship.
Sorry, yeah, this didn't merge until after 4.3 split off so the bug should have been moved to 4.4. Ignore the comments from the errata system; it's lying.
And actually, there were two parts, one in origin and one in sdn, and only the origin half merged, so this isn't fully fixed even in 4.4
This was merged on the 1.17 rebase *** This bug has been marked as a duplicate of bug 1803149 ***