1506396 – Increase iptables-restore timeout

Bug 1506396 - Increase iptables-restore timeout

Summary: Increase iptables-restore timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Rajat Chopra
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-25 21:45 UTC by Eric Paris
Modified:	2017-11-28 22:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: 'tuning' rather than an enhancement. Reason: the previous value wasn't suitable if two operations were done at the same time. Result: Better wait time for iptables operation so that things finish neatly, in order, without failures.
Clone Of:
Environment:
Last Closed:	2017-11-28 22:19:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Origin (Github)	17062	None	None	None	2017-10-30 12:55:30 UTC
Origin (Github)	17222	None	None	None	2017-11-07 17:44:46 UTC
Red Hat Product Errata	RHSA-2017:3188	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Eric Paris 2017-10-25 21:45:14 UTC

iptables-restore has a 2s wait timeout. Data collected today shows that even with a much faster kernel we can reasonably expect iptables-restore to take upwards of 2.4 seconds. (with unpatched/released RHEL kernel this can easily take 7-8 second)

longest runs I saw over about 30 minutes were:
2.267244
2.284707
2.291535
2.376457

If we get 2 iptables restores going at the same time, with a 2s timeout it is very likely the second will fail.

I'd like to suggest a 5s timeout. It should still bound the number of thread we may be waiting on and increases the reliability that a common situation will be automatically resolved without failing up the stack.

Thoughts?

Comment 1 Rajat Chopra 2017-10-26 22:18:06 UTC

@eparis: Is that what we want? https://github.com/openshift/origin/pull/17062

Comment 2 Dan Winship 2017-10-27 17:58:20 UTC

dcbw had indicated that some of the time that iptables-restore takes is just parsing the very large number of rules. Unfortunately, it looks like it grabs the lock *before* parsing, rather than *after*, so it's staying locked longer than it needs to. We should fix that.

Comment 3 Eric Paris 2017-10-27 18:15:22 UTC

We can file a RHEL BZ for that I guess. I'll do so.  But seeing 2.4s  (not waiting for the lock) I think a 5s timeout makes.

Comment 4 Dan Winship 2017-10-27 18:34:36 UTC

Actually, it looks like fixing it would be pretty hard so maybe don't bother

Comment 6 Meng Bo 2017-11-07 07:02:03 UTC

There is a hardcoded string in the iptables.go:
https://github.com/rajatchopra/kubernetes/blob/c5740a37379aa4905c9505082212610a1ac022c6/pkg/util/iptables/iptables.go#L595

Which causes the openshift node log always shows
Nov 07 15:01:18 ose-node1.bmeng.local atomic-openshift-node[97540]: I1107 15:01:18.899238   97540 iptables.go:371] running iptables-restore [--wait=2 --noflush --counters]

Comment 7 Ben Bennett 2017-11-07 16:03:43 UTC

Thanks Meng Bo.

Kube PR to correct that -- https://github.com/kubernetes/kubernetes/pull/55248

Will backport shortly.

Comment 8 Ben Bennett 2017-11-07 17:44:46 UTC

The Origin PR is https://github.com/openshift/origin/pull/17222

Comment 9 Xiaoli Tian 2017-11-09 03:33:33 UTC

Please test it on build 3.7.4-1 or newer version

Comment 10 Meng Bo 2017-11-09 10:59:49 UTC

Nov 09 18:58:53 ose-node2.bmeng.local atomic-openshift-node[25845]: I1109 18:58:52.989607   25845 iptables.go:371] running iptables-restore [-w5 --noflush --counters]

Verified on ocp v3.7.4-1

Comment 13 errata-xmlrpc 2017-11-28 22:19:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.