1727441 – kube-proxy periodic iptables reloads are extremely disruptive in large clusters

Bug 1727441 - kube-proxy periodic iptables reloads are extremely disruptive in large clusters

Summary: kube-proxy periodic iptables reloads are extremely disruptive in large clusters

Keywords:
Status:	CLOSED DUPLICATE of bug 1803149
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Juan Luis de Sousa-Valadas
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1801743 1801744
TreeView+	depends on / blocked

Reported:	2019-07-06 10:04 UTC by Miheer Salunke
Modified:	2020-04-03 12:53 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1801737 1801742 1801743 1801744 (view as bug list)
Environment:
Last Closed:	2020-02-20 11:52:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23872	0	'None'	closed	Bug 1727441: kube-proxy periodic iptables reloads are extremely disruptive in large clusters	2021-02-10 16:16:10 UTC

Comment 17 Dan Winship 2019-07-15 14:55:43 UTC

Belated update from Friday: it looks like the problem is:

      1. They have so many iptables rules that even things like
         "iptables -C" take a long time because of iptables API
         awfulness, and:

      2. Because of oddities in RHEL backporting and k8s iptables
         feature detection, OCP on RHEL 7 decides that
         "iptables-restore" supports "--wait=2", but "iptables" only
         supports "--wait" (forever).

      3. So the random periodic /sbin/iptables resync calls end up causing
         kube-proxy's iptables-restore calls to time out and fail.

      4. Fix: bump iptablesSyncPeriod up to something ridiculously high

This should hopefully get the customer's cluster stable enough that they can progress with their upgrade plans. We should look into having this work better out-of-the-box in 3.11 at least.

Comment 18 Ryan Howe 2019-07-15 20:48:44 UTC

So in the ose enterprise code we should set the variable WaitSecondsMinVersion = "1.4.21"  instead of what is set, "1.4.22", so that we make use of the --wait=seconds feature  since this was backported via bug 1438597

https://github.com/openshift/ose/blob/enterprise-3.11/vendor/k8s.io/kubernetes/pkg/util/iptables/iptables.go#L127

Comment 30 Anurag saxena 2019-12-10 00:30:36 UTC

Although its ON_QA but this needs to be backported to 4.3 as confirmed with Dan Winship.

Comment 31 Dan Winship 2019-12-10 13:11:27 UTC

Sorry, yeah, this didn't merge until after 4.3 split off so the bug should have been moved to 4.4. Ignore the comments from the errata system; it's lying.

Comment 32 Dan Winship 2019-12-10 13:13:27 UTC

And actually, there were two parts, one in origin and one in sdn, and only the origin half merged, so this isn't fully fixed even in 4.4

Comment 34 Juan Luis de Sousa-Valadas 2020-02-20 11:52:55 UTC

This was merged on the 1.17 rebase

*** This bug has been marked as a duplicate of bug 1803149 ***

Note You need to log in before you can comment on or make changes to this bug.