Bug 1314495 - Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.
Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node pro...
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.1.1
Assigned To: Andy Goldstein
DeShuai Ma
: Reopened
Depends On:
Blocks: OSOPS_V3
  Show dependency treegraph
Reported: 2016-03-03 13:46 EST by Ryan Howe
Modified: 2016-03-29 16:31 EDT (History)
11 users (show)

See Also:
Fixed In Version: atomic-openshift-
Doc Type: Bug Fix
Doc Text:
Cause: cadvisor was improperly collecting network stats for processes it did not need to gather stats for. Consequence: Increased CPU utilization may have occurred. Fix: cadvisor now only collects network stats for relevant processes. Result: Significantly decreased CPU utilization.
Story Points: ---
Clone Of:
Last Closed: 2016-03-24 11:53:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ryan Howe 2016-03-03 13:46:15 EST
Description of problem: 

After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher.

 cadvior time spent in regexp to high 

Version-Release number of selected component (if applicable):

Steps to Reproduce:
1. run top  and see openshift process running high on cpu 
2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node
3. systemctl restart atomic-openshift-node
4. review cpu.prof

Actual results:

 higher than wanted cpu load and time spent in regext  

Expected results:

  less time spent in regexp
Comment 3 Andy Goldstein 2016-03-04 09:58:10 EST
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.
Comment 4 Andy Goldstein 2016-03-07 10:52:00 EST
We believe the performance improvements in 3.2 should significantly reduce the CPU load.

Ryan, is it ok to close this?
Comment 5 Ryan Howe 2016-03-08 15:40:10 EST
That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?
Comment 6 Andy Goldstein 2016-03-08 15:44:35 EST
Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.
Comment 7 Andy Goldstein 2016-03-14 12:32:36 EDT
Reopening as this is a definite bug in Performing root cause analysis now.
Comment 9 Andy Goldstein 2016-03-14 14:26:32 EDT
At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second.

I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.
Comment 10 Timothy St. Clair 2016-03-14 14:33:50 EDT
xref: https://github.com/google/cadvisor/pull/1051
https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed.

I'll re-eval master to see if there is leakage.
Comment 11 Scott Dodson 2016-03-14 15:52:42 EDT
https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot
Comment 12 Andy Goldstein 2016-03-14 15:54:23 EDT
Upstream issue: https://github.com/google/cadvisor/issues/1156

Potential fix: https://github.com/google/cadvisor/pull/1158
Comment 15 DeShuai Ma 2016-03-18 06:21:41 EDT
Verify this bug on atomic-openshift-
[root@openshift-127 ~]# openshift version
openshift v3.1.1.6-29-g9a3b53e
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

openshift process cpu usage is not high.
Comment 17 errata-xmlrpc 2016-03-24 11:53:54 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.