Red Hat Bugzilla – Bug 1314495
Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.
Last modified: 2016-03-29 16:31:43 EDT
Description of problem:
After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher.
cadvior time spent in regexp to high
Version-Release number of selected component (if applicable): 22.214.171.124
Steps to Reproduce:
1. run top and see openshift process running high on cpu
2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node
3. systemctl restart atomic-openshift-node
4. review cpu.prof
higher than wanted cpu load and time spent in regext
less time spent in regexp
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.
We believe the performance improvements in 3.2 should significantly reduce the CPU load.
Ryan, is it ok to close this?
That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?
Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.
Reopening as this is a definite bug in 126.96.36.199. Performing root cause analysis now.
At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second.
I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.
https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed.
I'll re-eval master to see if there is leakage.
https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot
Upstream issue: https://github.com/google/cadvisor/issues/1156
Potential fix: https://github.com/google/cadvisor/pull/1158
Verify this bug on atomic-openshift-188.8.131.52-4.git.28.0d526e5.el7aos.x86_64
[root@openshift-127 ~]# openshift version
openshift process cpu usage is not high.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.