Description of problem: Customer has hundreds of services running in a cluster and sometimes iptables rules get a huge delay. Example from the logs using image v3.11.59: I0122 20:26:31.085850 108392 proxier.go:625] syncProxyRules took 3.8148758s I0122 20:27:38.678897 108392 proxier.go:625] syncProxyRules took 7.582426266s I0122 20:28:23.699460 108392 proxier.go:341] syncProxyRules took 14m14.428487423s ------------------> Long time above normal I0122 20:28:31.727444 108392 proxier.go:625] syncProxyRules took 8.027682494s I0122 20:28:47.932200 108392 proxier.go:625] syncProxyRules took 6.532942321s I0122 20:29:54.420936 108392 proxier.go:625] syncProxyRules took 6.481457251s Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse: I0127 17:43:26.625409 46399 proxier.go:631] syncProxyRules took 29.056µs I0127 17:43:26.625449 46399 proxier.go:348] userspace syncProxyRules took 23.439µs I0127 17:43:26.736930 46399 proxy.go:331] hybrid proxy: syncProxyRules start I0127 17:43:29.409150 46399 proxier.go:631] syncProxyRules took 2.681231793s I0127 17:43:30.814736 46399 proxier.go:631] syncProxyRules took 1.405132858s I0127 18:24:57.995657 46399 proxier.go:348] userspace syncProxyRules took 41m31.267520332s I0127 18:35:36.311966 46399 proxier.go:348] userspace syncProxyRules took 52m5.4970637s I0127 18:35:36.312017 46399 proxy.go:336] hybrid proxy: syncProxyRules finished I0127 18:35:36.312048 46399 proxy.go:331] hybrid proxy: syncProxyRules start I0127 18:35:39.285465 46399 proxier.go:631] syncProxyRules took 2.973370888s This sometimes lead to errors like this from the atomic-openshift-node service: Error adding network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option? Current relevant configuration: iptablesSyncPeriod: 1m iptables-min-sync-period: 90s Version-Release number of selected component (if applicable): OCP 3.11.98 SDN Container Image: ose-node:v3.11.161 How reproducible: Intermittent, couldn't reliabily reproduce.
> Tried to use the latest sdn image version (registry.redhat.io/openshift3/ose-node:v3.11.161) and it got worse: You can't just mix and match pieces. > iptablesSyncPeriod: 1m > iptables-min-sync-period: 90s Contrary to what some docs say, you should really shouldn't change min-sync-period, and you can freely raise iptablesSyncPeriod arbitrarily high. Try bumping it to "1h" and see if that helps. v3.11.59 is quite old. There have been many performances fixes since then. If a much higher iptablesSyncPeriod doesn't fix things then I would recommend that the customer upgrade to a more recent release.
*** Bug 1801744 has been marked as a duplicate of this bug. ***
@mike, could you help reproduce this issue and verified this bug?
*** Bug 1835440 has been marked as a duplicate of this bug. ***
We have been observing this issue also with OSE v3.11.88
Even updated the version of nodes to 3.11.200 but still the issue is same.
@zhanqi I don't think this bug requires a scalability environment to verify. Just a cluster with hundreds of services. We will not have any large scale 3.11 cluster again. @anbhat Advice on how to verify this bug?
@jtanenba Could you give some advice to verify this performance issue? Do we need cluster with many nodes or hundred of services?
I tried to come up with a way to test the 4.3 backport, but it didn't work: https://bugzilla.redhat.com/show_bug.cgi?id=1801737#c4. It's possible that that test _would_ cause problems with un-patched 3.11 though, because the 3.11 code had more performance problems to begin with than the original 4.3 code. It's less important to test that the performance problems are fixed though, and more important just to make sure that the PR didn't break any other services / kube-proxy / iptables functionality. I assume plenty of tests related to that will get run as part of the 3.11.z release process.
Thanks Dan. Given a regression testing on version 3.11.286 about service related testing, No issue found. Move this to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.286 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3695
*** Bug 1932651 has been marked as a duplicate of this bug. ***