Bug 1727441

Summary: kube-proxy periodic iptables reloads are extremely disruptive in large clusters
Product: OpenShift Container Platform Reporter: Miheer Salunke <misalunk>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: anusaxen, aos-bugs, cdc, danw, dcbw, dmoessne, erich, gferrazs, jcrumple, jnordell, msweiker, openshift-bugs-escalate, rbost, rhowe, ricarril, zhigwang
Version: 3.6.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1801737 1801742 1801743 1801744 (view as bug list) Environment:
Last Closed: 2020-02-20 11:52:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1801743, 1801744    

Comment 17 Dan Winship 2019-07-15 14:55:43 UTC
Belated update from Friday: it looks like the problem is:

      1. They have so many iptables rules that even things like
         "iptables -C" take a long time because of iptables API
         awfulness, and:

      2. Because of oddities in RHEL backporting and k8s iptables
         feature detection, OCP on RHEL 7 decides that
         "iptables-restore" supports "--wait=2", but "iptables" only
         supports "--wait" (forever).

      3. So the random periodic /sbin/iptables resync calls end up causing
         kube-proxy's iptables-restore calls to time out and fail.

      4. Fix: bump iptablesSyncPeriod up to something ridiculously high

This should hopefully get the customer's cluster stable enough that they can progress with their upgrade plans. We should look into having this work better out-of-the-box in 3.11 at least.

Comment 18 Ryan Howe 2019-07-15 20:48:44 UTC
So in the ose enterprise code we should set the variable WaitSecondsMinVersion = "1.4.21"  instead of what is set, "1.4.22", so that we make use of the --wait=seconds feature  since this was backported via bug 1438597

https://github.com/openshift/ose/blob/enterprise-3.11/vendor/k8s.io/kubernetes/pkg/util/iptables/iptables.go#L127

Comment 30 Anurag saxena 2019-12-10 00:30:36 UTC
Although its ON_QA but this needs to be backported to 4.3 as confirmed with Dan Winship.

Comment 31 Dan Winship 2019-12-10 13:11:27 UTC
Sorry, yeah, this didn't merge until after 4.3 split off so the bug should have been moved to 4.4. Ignore the comments from the errata system; it's lying.

Comment 32 Dan Winship 2019-12-10 13:13:27 UTC
And actually, there were two parts, one in origin and one in sdn, and only the origin half merged, so this isn't fully fixed even in 4.4

Comment 34 Juan Luis de Sousa-Valadas 2020-02-20 11:52:55 UTC
This was merged on the 1.17 rebase

*** This bug has been marked as a duplicate of bug 1803149 ***