Created attachment 1212392 [details] pidstat-stdout Description of problem: On 210 OpenShift Cluster and 19000 pods scheduled across nodes, it is noticeable on Openshift nodes ( and masters too ) that iptables and iptables-restore are biggest CPU consumers Version-Release number of selected component (if applicable): kernel-3.10.0-512.el7.x86_64 OpenShift v3.3 packages atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64 tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64 How reproducible: Create the biggest possible Openshift cluster you can and load to it maximum number of pods, eg 200 nodes, and 20k pods or so. In this test environments were 212 openshift nodes 3 openshift masters 3 openshift etcd servers 1 router node total 219 Openshift machines and 19k Openshift pods Once cluster is up and running and pods loaded, observer iptables/iptables-restore processes On this cluster, there were also 1) 819 services # oc get svc --all-namespaces | wc -l 819 2) 277 projects / namespaces # oc get ns | wc -l 277 Steps to Reproduce: See above Actual results: If done on one of Openshift nodes, eg below ( these results are from Openshift node where 100 pods were scheduled ) # # oadm manage-node dhcp8-234.example.net --list-pods | wc -l Listing matched pods on node: dhcp8-234.example.net 100 # ssh dhcp8-234.example.net # iptables-save > iptables-save.txt # du -h iptables-save.txt 25M iptables-save.txt # cat iptables-save.txt | wc -l 226259 iptables-save.txt from above mentioned node is attached. Attached is also output of below command running on Openshift node for 10 minutes : # usr/local/bin/pidstat -l -w -u -h -d -r -p ALL 3 -> pidstat-stdout.txt Expected results: Additional info: attached pidstat-stdout.txt iptables-save
Created attachment 1212393 [details] iptables-save from openshift node
Created attachment 1212394 [details] cpu_usage-average
There is work on cleaning up the way that the service proxy uses iptables to make it scale better. Right now the implementation of the iptables changes is pretty stupid... it wipes and reloads all the rules to make the tiniest change. We are tracking: - https://github.com/kubernetes/kubernetes/issues/14099 - https://github.com/kubernetes/kubernetes/issues/33693 - https://github.com/kubernetes/kubernetes/issues/26637 And Dan Williams was investigating an iptables performance discrepancy between our kernels and what other people were seeing on their kernels.
I don't think we're going to get resolution on these items until 1.5
*** Bug 1371971 has been marked as a duplicate of this bug. ***
Upstream PR - https://github.com/kubernetes/kubernetes/pull/35334 addresses part of this.
Follow up PR is here: https://github.com/kubernetes/kubernetes/pull/37726
*** Bug 1442676 has been marked as a duplicate of this bug. ***
https://github.com/kubernetes/kubernetes/pull/40868 has been closed upstream in favor of a different approach in https://github.com/kubernetes/kubernetes/pull/41022, which was merged to Kubernetes 1.6 already. There have also been a number of other optimizations in the iptables proxy which will help things, but most of those landed into kube 1.7. Some could be backported to OpenShift 3.6 or 3.5.
Moving to MODIFIED since https://github.com/kubernetes/kubernetes/pull/41022 is in OpenShift master now. In 3.5 you can also set iptablesMinSyncPeriod in the node config to limit how often iptables is refreshed.
Hi Ben, Is there a recommended value for iptablesMinSyncPeriod iiptablesMinSyncPeriod that would be ideal for the nodes ?
It really depends on the cluster. How many nodes are there, how many pods, and how many services. How often do pods come and go? And how tolerant of a service connection failure are they?
Marking this as verified for 3.6 for the improvements to date. Follow up studies of detailed iptables behavior will be queued up. ip-tables-min-sync-period and other improvements have reduced iptables cpu usage. Further improvements may be required based on further investigation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days