1387149 – Iptables service produced high load on openshift node/masters with many pods

Bug 1387149 - Iptables service produced high load on openshift node/masters with many pods

Summary: Iptables service produced high load on openshift node/masters with many pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.3.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-34
Duplicates (2):	1371971 1442676 (view as bug list)
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-10-20 09:17 UTC by Elvir Kuric
Modified:	2023-09-14 03:33 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Minor enhancements have been made to the iptables proxier to reduce node CPU usage when many pods and services exist.
Clone Of:
Environment:
Last Closed:	2017-08-10 05:15:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pidstat-stdout (515.92 KB, application/x-7z-compressed) 2016-10-20 09:17 UTC, Elvir Kuric	no flags	Details
iptables-save from openshift node (1.99 MB, application/x-7z-compressed) 2016-10-20 09:18 UTC, Elvir Kuric	no flags	Details
cpu_usage-average (12.32 KB, application/x-7z-compressed) 2016-10-20 09:20 UTC, Elvir Kuric	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3361851	0	None	None	None	2018-02-22 18:08:25 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Elvir Kuric 2016-10-20 09:17:44 UTC

Created attachment 1212392 [details]
pidstat-stdout

Description of problem:

On 210 OpenShift Cluster and 19000 pods scheduled across nodes, it is noticeable on Openshift nodes ( and masters too ) that iptables and iptables-restore are biggest CPU consumers 


Version-Release number of selected component (if applicable):

kernel-3.10.0-512.el7.x86_64

OpenShift v3.3 packages 

atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64
tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64

How reproducible:

Create the biggest possible Openshift cluster you can and load to it maximum number of pods, eg 200 nodes, and 20k pods or so.

In this test environments were 

212 openshift nodes 
3 openshift masters 
3 openshift etcd servers 
1 router node 

total 219 Openshift machines and 19k Openshift pods 

Once cluster is up and running and pods loaded, observer iptables/iptables-restore processes 

On this cluster, there were also 

1) 819 services 

# oc get svc --all-namespaces  | wc -l
819 

2) 277 projects / namespaces 

# oc get ns | wc -l
277

Steps to Reproduce:
See above 

Actual results:

If done on one of Openshift nodes, eg below ( these results are from Openshift node where 100 pods were scheduled ) 

# # oadm manage-node  dhcp8-234.example.net --list-pods | wc -l
Listing matched pods on node: dhcp8-234.example.net
100
# ssh dhcp8-234.example.net

# iptables-save > iptables-save.txt
#  du -h iptables-save.txt 
25M	iptables-save.txt

# cat iptables-save.txt | wc -l
226259

iptables-save.txt from above mentioned node is attached. 

Attached is also output of below command running on Openshift node for 10 minutes : 

# usr/local/bin/pidstat  -l -w -u -h -d -r  -p ALL  3

-> pidstat-stdout.txt 


Expected results:


Additional info:
attached pidstat-stdout.txt 
iptables-save

Comment 1 Elvir Kuric 2016-10-20 09:18:34 UTC

Created attachment 1212393 [details]
iptables-save from openshift node

Comment 2 Elvir Kuric 2016-10-20 09:20:04 UTC

Created attachment 1212394 [details]
cpu_usage-average

Comment 5 Ben Bennett 2016-10-28 15:50:43 UTC

There is work on cleaning up the way that the service proxy uses iptables to make it scale better.  Right now the implementation of the iptables changes is pretty stupid... it wipes and reloads all the rules to make the tiniest change.

We are tracking:
- https://github.com/kubernetes/kubernetes/issues/14099
- https://github.com/kubernetes/kubernetes/issues/33693
- https://github.com/kubernetes/kubernetes/issues/26637

And Dan Williams was investigating an iptables performance discrepancy between our kernels and what other people were seeing on their kernels.

Comment 6 Timothy St. Clair 2016-10-28 15:57:34 UTC

I don't think we're going to get resolution on these items until 1.5

Comment 7 Timothy St. Clair 2016-10-28 16:02:29 UTC

*** Bug 1371971 has been marked as a duplicate of this bug. ***

Comment 8 Timothy St. Clair 2016-10-28 16:04:07 UTC

Upstream PR - https://github.com/kubernetes/kubernetes/pull/35334 addresses part of this.

Comment 9 Timothy St. Clair 2016-12-07 02:34:26 UTC

Follow up PR is here: https://github.com/kubernetes/kubernetes/pull/37726

Comment 19 Ben Bennett 2017-04-24 15:18:09 UTC

*** Bug 1442676 has been marked as a duplicate of this bug. ***

Comment 21 Dan Williams 2017-04-25 18:13:26 UTC

https://github.com/kubernetes/kubernetes/pull/40868 has been closed upstream in favor of a different approach in https://github.com/kubernetes/kubernetes/pull/41022, which was merged to Kubernetes 1.6 already.

There have also been a number of other optimizations in the iptables proxy which will help things, but most of those landed into kube 1.7.  Some could be backported to OpenShift 3.6 or 3.5.

Comment 23 Ben Bennett 2017-05-16 14:08:19 UTC

Moving to MODIFIED since https://github.com/kubernetes/kubernetes/pull/41022 is in OpenShift master now.

In 3.5 you can also set iptablesMinSyncPeriod in the node config to limit how often iptables is refreshed.

Comment 24 Jaspreet Kaur 2017-05-25 09:17:17 UTC

Hi Ben,

Is there a recommended value for iptablesMinSyncPeriod iiptablesMinSyncPeriod that would be ideal for the nodes ?

Comment 25 Ben Bennett 2017-05-25 12:36:08 UTC

It really depends on the cluster.  How many nodes are there, how many pods, and how many services.  How often do pods come and go?  And how tolerant of a service connection failure are they?

Comment 36 Mike Fiedler 2017-07-21 19:29:05 UTC

Marking this as verified for 3.6 for the improvements to date.  Follow up studies of detailed iptables behavior will be queued up.

ip-tables-min-sync-period and other improvements have reduced iptables cpu usage.   Further improvements may be required based on further investigation.

Comment 39 errata-xmlrpc 2017-08-10 05:15:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 41 Red Hat Bugzilla 2023-09-14 03:33:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.

aloughla
aos-bugs
atragler
bbennett
bmeng
clichybi
dcbw
eparis
gsharma
jeder
jkaur
jokerman
mchappel
mifiedle
mleitner
mmccomas
rbost
smunilla
sukulkar
trankin
tstclair