Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1387149

Summary: Iptables service produced high load on openshift node/masters with many pods
Product: OpenShift Container Platform Reporter: Elvir Kuric <ekuric>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.1CC: aloughla, aos-bugs, atragler, bbennett, bmeng, clichybi, dcbw, eparis, gsharma, jeder, jkaur, jokerman, mchappel, mifiedle, mleitner, mmccomas, rbost, smunilla, sukulkar, trankin, tstclair
Target Milestone: ---Keywords: OpsBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-34
Fixed In Version: Doc Type: Enhancement
Doc Text:
Minor enhancements have been made to the iptables proxier to reduce node CPU usage when many pods and services exist.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:15:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130    
Attachments:
Description Flags
pidstat-stdout
none
iptables-save from openshift node
none
cpu_usage-average none

Description Elvir Kuric 2016-10-20 09:17:44 UTC
Created attachment 1212392 [details]
pidstat-stdout

Description of problem:

On 210 OpenShift Cluster and 19000 pods scheduled across nodes, it is noticeable on Openshift nodes ( and masters too ) that iptables and iptables-restore are biggest CPU consumers 


Version-Release number of selected component (if applicable):

kernel-3.10.0-512.el7.x86_64

OpenShift v3.3 packages 

atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64
tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64
atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64

How reproducible:

Create the biggest possible Openshift cluster you can and load to it maximum number of pods, eg 200 nodes, and 20k pods or so.

In this test environments were 

212 openshift nodes 
3 openshift masters 
3 openshift etcd servers 
1 router node 

total 219 Openshift machines and 19k Openshift pods 

Once cluster is up and running and pods loaded, observer iptables/iptables-restore processes 

On this cluster, there were also 

1) 819 services 

# oc get svc --all-namespaces  | wc -l
819 

2) 277 projects / namespaces 

# oc get ns | wc -l
277

Steps to Reproduce:
See above 

Actual results:

If done on one of Openshift nodes, eg below ( these results are from Openshift node where 100 pods were scheduled ) 

# # oadm manage-node  dhcp8-234.example.net --list-pods | wc -l
Listing matched pods on node: dhcp8-234.example.net
100
# ssh dhcp8-234.example.net

# iptables-save > iptables-save.txt
#  du -h iptables-save.txt 
25M	iptables-save.txt

# cat iptables-save.txt | wc -l
226259

iptables-save.txt from above mentioned node is attached. 

Attached is also output of below command running on Openshift node for 10 minutes : 

# usr/local/bin/pidstat  -l -w -u -h -d -r  -p ALL  3

-> pidstat-stdout.txt 


Expected results:


Additional info:
attached pidstat-stdout.txt 
iptables-save

Comment 1 Elvir Kuric 2016-10-20 09:18:34 UTC
Created attachment 1212393 [details]
iptables-save from openshift node

Comment 2 Elvir Kuric 2016-10-20 09:20:04 UTC
Created attachment 1212394 [details]
cpu_usage-average

Comment 5 Ben Bennett 2016-10-28 15:50:43 UTC
There is work on cleaning up the way that the service proxy uses iptables to make it scale better.  Right now the implementation of the iptables changes is pretty stupid... it wipes and reloads all the rules to make the tiniest change.

We are tracking:
- https://github.com/kubernetes/kubernetes/issues/14099
- https://github.com/kubernetes/kubernetes/issues/33693
- https://github.com/kubernetes/kubernetes/issues/26637

And Dan Williams was investigating an iptables performance discrepancy between our kernels and what other people were seeing on their kernels.

Comment 6 Timothy St. Clair 2016-10-28 15:57:34 UTC
I don't think we're going to get resolution on these items until 1.5

Comment 7 Timothy St. Clair 2016-10-28 16:02:29 UTC
*** Bug 1371971 has been marked as a duplicate of this bug. ***

Comment 8 Timothy St. Clair 2016-10-28 16:04:07 UTC
Upstream PR - https://github.com/kubernetes/kubernetes/pull/35334 addresses part of this.

Comment 9 Timothy St. Clair 2016-12-07 02:34:26 UTC
Follow up PR is here: https://github.com/kubernetes/kubernetes/pull/37726

Comment 19 Ben Bennett 2017-04-24 15:18:09 UTC
*** Bug 1442676 has been marked as a duplicate of this bug. ***

Comment 21 Dan Williams 2017-04-25 18:13:26 UTC
https://github.com/kubernetes/kubernetes/pull/40868 has been closed upstream in favor of a different approach in https://github.com/kubernetes/kubernetes/pull/41022, which was merged to Kubernetes 1.6 already.

There have also been a number of other optimizations in the iptables proxy which will help things, but most of those landed into kube 1.7.  Some could be backported to OpenShift 3.6 or 3.5.

Comment 23 Ben Bennett 2017-05-16 14:08:19 UTC
Moving to MODIFIED since https://github.com/kubernetes/kubernetes/pull/41022 is in OpenShift master now.

In 3.5 you can also set iptablesMinSyncPeriod in the node config to limit how often iptables is refreshed.

Comment 24 Jaspreet Kaur 2017-05-25 09:17:17 UTC
Hi Ben,

Is there a recommended value for iptablesMinSyncPeriod iiptablesMinSyncPeriod that would be ideal for the nodes ?

Comment 25 Ben Bennett 2017-05-25 12:36:08 UTC
It really depends on the cluster.  How many nodes are there, how many pods, and how many services.  How often do pods come and go?  And how tolerant of a service connection failure are they?

Comment 36 Mike Fiedler 2017-07-21 19:29:05 UTC
Marking this as verified for 3.6 for the improvements to date.  Follow up studies of detailed iptables behavior will be queued up.

ip-tables-min-sync-period and other improvements have reduced iptables cpu usage.   Further improvements may be required based on further investigation.

Comment 39 errata-xmlrpc 2017-08-10 05:15:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 41 Red Hat Bugzilla 2023-09-14 03:33:07 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days