1795416 – iptables sync sometimes taking too long

Bug 1795416 - iptables sync sometimes taking too long

Summary: iptables sync sometimes taking too long

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Jacob Tanenbaum
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1801744 1835440 1932651 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-27 22:03 UTC by Hugo Cisneiros (Eitch)
Modified:	2024-06-13 22:26 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-16 07:46:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24748	0	None	closed	[Release 3.11] Bug 1795416: Multiple iptables improvements for kubelet and kube-proxy	2021-02-19 19:01:27 UTC
Red Hat Product Errata	RHBA-2020:3695	0	None	None	None	2020-09-16 07:47:04 UTC

Description Hugo Cisneiros (Eitch) 2020-01-27 22:03:41 UTC

Description of problem:

Customer has hundreds of services running in a cluster and sometimes iptables rules get a huge delay. Example from the logs using image v3.11.59:

I0122 20:26:31.085850  108392 proxier.go:625] syncProxyRules took 3.8148758s
I0122 20:27:38.678897  108392 proxier.go:625] syncProxyRules took 7.582426266s
I0122 20:28:23.699460  108392 proxier.go:341] syncProxyRules took 14m14.428487423s  ------------------> Long time above normal
I0122 20:28:31.727444  108392 proxier.go:625] syncProxyRules took 8.027682494s
I0122 20:28:47.932200  108392 proxier.go:625] syncProxyRules took 6.532942321s
I0122 20:29:54.420936  108392 proxier.go:625] syncProxyRules took 6.481457251s

Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse:

I0127 17:43:26.625409   46399 proxier.go:631] syncProxyRules took 29.056µs
I0127 17:43:26.625449   46399 proxier.go:348] userspace syncProxyRules took 23.439µs
I0127 17:43:26.736930   46399 proxy.go:331] hybrid proxy: syncProxyRules start
I0127 17:43:29.409150   46399 proxier.go:631] syncProxyRules took 2.681231793s
I0127 17:43:30.814736   46399 proxier.go:631] syncProxyRules took 1.405132858s
I0127 18:24:57.995657   46399 proxier.go:348] userspace syncProxyRules took 41m31.267520332s
I0127 18:35:36.311966   46399 proxier.go:348] userspace syncProxyRules took 52m5.4970637s
I0127 18:35:36.312017   46399 proxy.go:336] hybrid proxy: syncProxyRules finished
I0127 18:35:36.312048   46399 proxy.go:331] hybrid proxy: syncProxyRules start
I0127 18:35:39.285465   46399 proxier.go:631] syncProxyRules took 2.973370888s

This sometimes lead to errors like this from the atomic-openshift-node service:

Error adding network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

Current relevant configuration:

iptablesSyncPeriod: 1m
iptables-min-sync-period: 90s

Version-Release number of selected component (if applicable):

OCP 3.11.98
SDN Container Image: ose-node:v3.11.161

How reproducible:

Intermittent, couldn't reliabily reproduce.

Comment 4 Dan Winship 2020-02-10 17:07:02 UTC

> Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse:

You can't just mix and match pieces.

> iptablesSyncPeriod: 1m
> iptables-min-sync-period: 90s

Contrary to what some docs say, you should really shouldn't change min-sync-period, and you can freely raise iptablesSyncPeriod arbitrarily high. Try bumping it to "1h" and see if that helps.


v3.11.59 is quite old. There have been many performances fixes since then. If a much higher iptablesSyncPeriod doesn't fix things then I would recommend that the customer upgrade to a more recent release.

Comment 10 Juan Luis de Sousa-Valadas 2020-03-04 09:42:38 UTC

*** Bug 1801744 has been marked as a duplicate of this bug. ***

Comment 13 zhaozhanqi 2020-04-15 02:14:40 UTC

@mike, could you help reproduce this issue and verified this bug?

Comment 15 Ben Bennett 2020-05-20 13:19:35 UTC

*** Bug 1835440 has been marked as a duplicate of this bug. ***

Comment 19 Sandeep 2020-06-29 12:17:27 UTC

We have been observing this issue also with OSE v3.11.88

Comment 21 Sandeep 2020-07-14 16:33:17 UTC

Even updated the version of nodes to 3.11.200 but still the issue is same.

Comment 26 Mike Fiedler 2020-08-13 12:27:32 UTC

@zhanqi  I don't think this bug requires a scalability environment to verify.  Just a cluster with hundreds of services.   We will not have any large scale 3.11 cluster again.

@anbhat Advice on how to verify this bug?

Comment 33 zhaozhanqi 2020-09-14 07:24:09 UTC

@jtanenba
Could you give some advice to verify this performance issue?  Do we need cluster with many nodes or hundred of services?

Comment 34 Dan Winship 2020-09-14 11:28:04 UTC

I tried to come up with a way to test the 4.3 backport, but it didn't work: https://bugzilla.redhat.com/show_bug.cgi?id=1801737#c4. It's possible that that test _would_ cause problems with un-patched 3.11 though, because the 3.11 code had more performance problems to begin with than the original 4.3 code.

It's less important to test that the performance problems are fixed though, and more important just to make sure that the PR didn't break any other services / kube-proxy / iptables functionality. I assume plenty of tests related to that will get run as part of the 3.11.z release process.

Comment 35 zhaozhanqi 2020-09-15 02:24:42 UTC

Thanks Dan. Given a regression testing on version 3.11.286 about service related testing, No issue found. 
Move this to verified.

Comment 37 errata-xmlrpc 2020-09-16 07:46:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.286 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3695

Comment 38 Ben Bennett 2021-02-25 17:22:06 UTC

*** Bug 1932651 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

aconstan
aivaraslaimikis
alchan
anbhat
andbartl
apurty
bbennett
bfurtado
ckoep
danw
dyocum
erich
fpan
gbravi
jcrumple
jdesousa
jnordell
jtanenba
lstanton
openshift-bugs-escalate
osousa
rkhan
rsandu
sandeep.agarwal2
sdodson