iptables-restore has a 2s wait timeout. Data collected today shows that even with a much faster kernel we can reasonably expect iptables-restore to take upwards of 2.4 seconds. (with unpatched/released RHEL kernel this can easily take 7-8 second) longest runs I saw over about 30 minutes were: 2.267244 2.284707 2.291535 2.376457 If we get 2 iptables restores going at the same time, with a 2s timeout it is very likely the second will fail. I'd like to suggest a 5s timeout. It should still bound the number of thread we may be waiting on and increases the reliability that a common situation will be automatically resolved without failing up the stack. Thoughts?
@eparis: Is that what we want? https://github.com/openshift/origin/pull/17062
dcbw had indicated that some of the time that iptables-restore takes is just parsing the very large number of rules. Unfortunately, it looks like it grabs the lock *before* parsing, rather than *after*, so it's staying locked longer than it needs to. We should fix that.
We can file a RHEL BZ for that I guess. I'll do so. But seeing 2.4s (not waiting for the lock) I think a 5s timeout makes.
Actually, it looks like fixing it would be pretty hard so maybe don't bother
There is a hardcoded string in the iptables.go: https://github.com/rajatchopra/kubernetes/blob/c5740a37379aa4905c9505082212610a1ac022c6/pkg/util/iptables/iptables.go#L595 Which causes the openshift node log always shows Nov 07 15:01:18 ose-node1.bmeng.local atomic-openshift-node[97540]: I1107 15:01:18.899238 97540 iptables.go:371] running iptables-restore [--wait=2 --noflush --counters]
Thanks Meng Bo. Kube PR to correct that -- https://github.com/kubernetes/kubernetes/pull/55248 Will backport shortly.
The Origin PR is https://github.com/openshift/origin/pull/17222
Please test it on build 3.7.4-1 or newer version
Nov 09 18:58:53 ose-node2.bmeng.local atomic-openshift-node[25845]: I1109 18:58:52.989607 25845 iptables.go:371] running iptables-restore [-w5 --noflush --counters] Verified on ocp v3.7.4-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188