Bug 1506396 - Increase iptables-restore timeout
Summary: Increase iptables-restore timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.7.0
Assignee: Rajat Chopra
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-25 21:45 UTC by Eric Paris
Modified: 2017-11-28 22:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: 'tuning' rather than an enhancement. Reason: the previous value wasn't suitable if two operations were done at the same time. Result: Better wait time for iptables operation so that things finish neatly, in order, without failures.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:19:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 17062 0 None None None 2017-10-30 12:55:30 UTC
Origin (Github) 17222 0 None None None 2017-11-07 17:44:46 UTC
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Eric Paris 2017-10-25 21:45:14 UTC
iptables-restore has a 2s wait timeout. Data collected today shows that even with a much faster kernel we can reasonably expect iptables-restore to take upwards of 2.4 seconds. (with unpatched/released RHEL kernel this can easily take 7-8 second)

longest runs I saw over about 30 minutes were:
2.267244
2.284707
2.291535
2.376457

If we get 2 iptables restores going at the same time, with a 2s timeout it is very likely the second will fail.

I'd like to suggest a 5s timeout. It should still bound the number of thread we may be waiting on and increases the reliability that a common situation will be automatically resolved without failing up the stack.

Thoughts?

Comment 1 Rajat Chopra 2017-10-26 22:18:06 UTC
@eparis: Is that what we want? https://github.com/openshift/origin/pull/17062

Comment 2 Dan Winship 2017-10-27 17:58:20 UTC
dcbw had indicated that some of the time that iptables-restore takes is just parsing the very large number of rules. Unfortunately, it looks like it grabs the lock *before* parsing, rather than *after*, so it's staying locked longer than it needs to. We should fix that.

Comment 3 Eric Paris 2017-10-27 18:15:22 UTC
We can file a RHEL BZ for that I guess. I'll do so.  But seeing 2.4s  (not waiting for the lock) I think a 5s timeout makes.

Comment 4 Dan Winship 2017-10-27 18:34:36 UTC
Actually, it looks like fixing it would be pretty hard so maybe don't bother

Comment 6 Meng Bo 2017-11-07 07:02:03 UTC
There is a hardcoded string in the iptables.go:
https://github.com/rajatchopra/kubernetes/blob/c5740a37379aa4905c9505082212610a1ac022c6/pkg/util/iptables/iptables.go#L595

Which causes the openshift node log always shows
Nov 07 15:01:18 ose-node1.bmeng.local atomic-openshift-node[97540]: I1107 15:01:18.899238   97540 iptables.go:371] running iptables-restore [--wait=2 --noflush --counters]

Comment 7 Ben Bennett 2017-11-07 16:03:43 UTC
Thanks Meng Bo.

Kube PR to correct that -- https://github.com/kubernetes/kubernetes/pull/55248

Will backport shortly.

Comment 8 Ben Bennett 2017-11-07 17:44:46 UTC
The Origin PR is https://github.com/openshift/origin/pull/17222

Comment 9 Xiaoli Tian 2017-11-09 03:33:33 UTC
Please test it on build 3.7.4-1 or newer version

Comment 10 Meng Bo 2017-11-09 10:59:49 UTC
Nov 09 18:58:53 ose-node2.bmeng.local atomic-openshift-node[25845]: I1109 18:58:52.989607   25845 iptables.go:371] running iptables-restore [-w5 --noflush --counters]

Verified on ocp v3.7.4-1

Comment 13 errata-xmlrpc 2017-11-28 22:19:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.