Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1506396 - Increase iptables-restore timeout
Increase iptables-restore timeout
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.7.0
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 3.7.0
Assigned To: Rajat Chopra
Meng Bo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-25 17:45 EDT by Eric Paris
Modified: 2017-11-28 17:19 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: 'tuning' rather than an enhancement. Reason: the previous value wasn't suitable if two operations were done at the same time. Result: Better wait time for iptables operation so that things finish neatly, in order, without failures.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:19:38 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Origin (Github) 17062 None None None 2017-10-30 08:55 EDT
Origin (Github) 17222 None None None 2017-11-07 12:44 EST
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description Eric Paris 2017-10-25 17:45:14 EDT
iptables-restore has a 2s wait timeout. Data collected today shows that even with a much faster kernel we can reasonably expect iptables-restore to take upwards of 2.4 seconds. (with unpatched/released RHEL kernel this can easily take 7-8 second)

longest runs I saw over about 30 minutes were:
2.267244
2.284707
2.291535
2.376457

If we get 2 iptables restores going at the same time, with a 2s timeout it is very likely the second will fail.

I'd like to suggest a 5s timeout. It should still bound the number of thread we may be waiting on and increases the reliability that a common situation will be automatically resolved without failing up the stack.

Thoughts?
Comment 1 Rajat Chopra 2017-10-26 18:18:06 EDT
@eparis: Is that what we want? https://github.com/openshift/origin/pull/17062
Comment 2 Dan Winship 2017-10-27 13:58:20 EDT
dcbw had indicated that some of the time that iptables-restore takes is just parsing the very large number of rules. Unfortunately, it looks like it grabs the lock *before* parsing, rather than *after*, so it's staying locked longer than it needs to. We should fix that.
Comment 3 Eric Paris 2017-10-27 14:15:22 EDT
We can file a RHEL BZ for that I guess. I'll do so.  But seeing 2.4s  (not waiting for the lock) I think a 5s timeout makes.
Comment 4 Dan Winship 2017-10-27 14:34:36 EDT
Actually, it looks like fixing it would be pretty hard so maybe don't bother
Comment 6 Meng Bo 2017-11-07 02:02:03 EST
There is a hardcoded string in the iptables.go:
https://github.com/rajatchopra/kubernetes/blob/c5740a37379aa4905c9505082212610a1ac022c6/pkg/util/iptables/iptables.go#L595

Which causes the openshift node log always shows
Nov 07 15:01:18 ose-node1.bmeng.local atomic-openshift-node[97540]: I1107 15:01:18.899238   97540 iptables.go:371] running iptables-restore [--wait=2 --noflush --counters]
Comment 7 Ben Bennett 2017-11-07 11:03:43 EST
Thanks Meng Bo.

Kube PR to correct that -- https://github.com/kubernetes/kubernetes/pull/55248

Will backport shortly.
Comment 8 Ben Bennett 2017-11-07 12:44:46 EST
The Origin PR is https://github.com/openshift/origin/pull/17222
Comment 9 Xiaoli Tian 2017-11-08 22:33:33 EST
Please test it on build 3.7.4-1 or newer version
Comment 10 Meng Bo 2017-11-09 05:59:49 EST
Nov 09 18:58:53 ose-node2.bmeng.local atomic-openshift-node[25845]: I1109 18:58:52.989607   25845 iptables.go:371] running iptables-restore [-w5 --noflush --counters]

Verified on ocp v3.7.4-1
Comment 13 errata-xmlrpc 2017-11-28 17:19:38 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.