Bug 1795416

Summary:	iptables sync sometimes taking too long
Product:	OpenShift Container Platform	Reporter:	Hugo Cisneiros (Eitch) <hcisneir>
Component:	Networking	Assignee:	Jacob Tanenbaum <jtanenba>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aconstan, aivaras.laimikis, alchan, anbhat, andbartl, apurty, bbennett, bfurtado, ckoep, danw, dyocum, erich, fpan, gbravi, jcrumple, jdesousa, jnordell, jtanenba, lstanton, openshift-bugs-escalate, osousa, rkhan, rsandu, sandeep.agarwal2, sdodson
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-16 07:46:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hugo Cisneiros (Eitch) 2020-01-27 22:03:41 UTC

Description of problem:

Customer has hundreds of services running in a cluster and sometimes iptables rules get a huge delay. Example from the logs using image v3.11.59:

I0122 20:26:31.085850  108392 proxier.go:625] syncProxyRules took 3.8148758s
I0122 20:27:38.678897  108392 proxier.go:625] syncProxyRules took 7.582426266s
I0122 20:28:23.699460  108392 proxier.go:341] syncProxyRules took 14m14.428487423s  ------------------> Long time above normal
I0122 20:28:31.727444  108392 proxier.go:625] syncProxyRules took 8.027682494s
I0122 20:28:47.932200  108392 proxier.go:625] syncProxyRules took 6.532942321s
I0122 20:29:54.420936  108392 proxier.go:625] syncProxyRules took 6.481457251s

Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse:

I0127 17:43:26.625409   46399 proxier.go:631] syncProxyRules took 29.056µs
I0127 17:43:26.625449   46399 proxier.go:348] userspace syncProxyRules took 23.439µs
I0127 17:43:26.736930   46399 proxy.go:331] hybrid proxy: syncProxyRules start
I0127 17:43:29.409150   46399 proxier.go:631] syncProxyRules took 2.681231793s
I0127 17:43:30.814736   46399 proxier.go:631] syncProxyRules took 1.405132858s
I0127 18:24:57.995657   46399 proxier.go:348] userspace syncProxyRules took 41m31.267520332s
I0127 18:35:36.311966   46399 proxier.go:348] userspace syncProxyRules took 52m5.4970637s
I0127 18:35:36.312017   46399 proxy.go:336] hybrid proxy: syncProxyRules finished
I0127 18:35:36.312048   46399 proxy.go:331] hybrid proxy: syncProxyRules start
I0127 18:35:39.285465   46399 proxier.go:631] syncProxyRules took 2.973370888s

This sometimes lead to errors like this from the atomic-openshift-node service:

Error adding network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

Current relevant configuration:

iptablesSyncPeriod: 1m
iptables-min-sync-period: 90s

Version-Release number of selected component (if applicable):

OCP 3.11.98
SDN Container Image: ose-node:v3.11.161

How reproducible:

Intermittent, couldn't reliabily reproduce.

Comment 4 Dan Winship 2020-02-10 17:07:02 UTC

> Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse:

You can't just mix and match pieces.

> iptablesSyncPeriod: 1m
> iptables-min-sync-period: 90s

Contrary to what some docs say, you should really shouldn't change min-sync-period, and you can freely raise iptablesSyncPeriod arbitrarily high. Try bumping it to "1h" and see if that helps.


v3.11.59 is quite old. There have been many performances fixes since then. If a much higher iptablesSyncPeriod doesn't fix things then I would recommend that the customer upgrade to a more recent release.

Comment 10 Juan Luis de Sousa-Valadas 2020-03-04 09:42:38 UTC

*** Bug 1801744 has been marked as a duplicate of this bug. ***

Comment 13 zhaozhanqi 2020-04-15 02:14:40 UTC

@mike, could you help reproduce this issue and verified this bug?

Comment 15 Ben Bennett 2020-05-20 13:19:35 UTC

*** Bug 1835440 has been marked as a duplicate of this bug. ***

Comment 19 Sandeep 2020-06-29 12:17:27 UTC

We have been observing this issue also with OSE v3.11.88

Comment 21 Sandeep 2020-07-14 16:33:17 UTC

Even updated the version of nodes to 3.11.200 but still the issue is same.

Comment 26 Mike Fiedler 2020-08-13 12:27:32 UTC

@zhanqi  I don't think this bug requires a scalability environment to verify.  Just a cluster with hundreds of services.   We will not have any large scale 3.11 cluster again.

@anbhat Advice on how to verify this bug?

Comment 33 zhaozhanqi 2020-09-14 07:24:09 UTC

@jtanenba
Could you give some advice to verify this performance issue?  Do we need cluster with many nodes or hundred of services?

Comment 34 Dan Winship 2020-09-14 11:28:04 UTC

I tried to come up with a way to test the 4.3 backport, but it didn't work: https://bugzilla.redhat.com/show_bug.cgi?id=1801737#c4. It's possible that that test _would_ cause problems with un-patched 3.11 though, because the 3.11 code had more performance problems to begin with than the original 4.3 code.

It's less important to test that the performance problems are fixed though, and more important just to make sure that the PR didn't break any other services / kube-proxy / iptables functionality. I assume plenty of tests related to that will get run as part of the 3.11.z release process.

Comment 35 zhaozhanqi 2020-09-15 02:24:42 UTC

Thanks Dan. Given a regression testing on version 3.11.286 about service related testing, No issue found. 
Move this to verified.

Comment 37 errata-xmlrpc 2020-09-16 07:46:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.286 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3695

Comment 38 Ben Bennett 2021-02-25 17:22:06 UTC

*** Bug 1932651 has been marked as a duplicate of this bug. ***