Description of problem: After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher. cadvior time spent in regexp to high Version-Release number of selected component (if applicable): 3.1.1.6 Steps to Reproduce: 1. run top and see openshift process running high on cpu 2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node 3. systemctl restart atomic-openshift-node 4. review cpu.prof Actual results: higher than wanted cpu load and time spent in regext Expected results: less time spent in regexp
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.
We believe the performance improvements in 3.2 should significantly reduce the CPU load. Ryan, is it ok to close this?
That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?
Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.
Reopening as this is a definite bug in 3.1.1.6. Performing root cause analysis now.
At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second. I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.
xref: https://github.com/google/cadvisor/pull/1051 https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed. I'll re-eval master to see if there is leakage.
https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot
Upstream issue: https://github.com/google/cadvisor/issues/1156 Potential fix: https://github.com/google/cadvisor/pull/1158
Verify this bug on atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos.x86_64 [root@openshift-127 ~]# openshift version openshift v3.1.1.6-29-g9a3b53e kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 openshift process cpu usage is not high.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0510