Bug 1314495 - Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.
Summary: Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node pro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.1.1
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-03-03 18:46 UTC by Ryan Howe
Modified: 2019-11-14 07:32 UTC (History)
11 users (show)

Fixed In Version: atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos
Doc Type: Bug Fix
Doc Text:
Cause: cadvisor was improperly collecting network stats for processes it did not need to gather stats for. Consequence: Increased CPU utilization may have occurred. Fix: cadvisor now only collects network stats for relevant processes. Result: Significantly decreased CPU utilization.
Clone Of:
Environment:
Last Closed: 2016-03-24 15:53:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0510 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise bug fix update 2016-03-24 19:53:32 UTC

Description Ryan Howe 2016-03-03 18:46:15 UTC
Description of problem: 

After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher.

 cadvior time spent in regexp to high 

Version-Release number of selected component (if applicable): 3.1.1.6

Steps to Reproduce:
1. run top  and see openshift process running high on cpu 
   
2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node
3. systemctl restart atomic-openshift-node
4. review cpu.prof

Actual results:

 higher than wanted cpu load and time spent in regext  

Expected results:

  less time spent in regexp

Comment 3 Andy Goldstein 2016-03-04 14:58:10 UTC
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.

Comment 4 Andy Goldstein 2016-03-07 15:52:00 UTC
We believe the performance improvements in 3.2 should significantly reduce the CPU load.

Ryan, is it ok to close this?

Comment 5 Ryan Howe 2016-03-08 20:40:10 UTC
That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?

Comment 6 Andy Goldstein 2016-03-08 20:44:35 UTC
Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.

Comment 7 Andy Goldstein 2016-03-14 16:32:36 UTC
Reopening as this is a definite bug in 3.1.1.6. Performing root cause analysis now.

Comment 9 Andy Goldstein 2016-03-14 18:26:32 UTC
At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second.

I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.

Comment 10 Timothy St. Clair 2016-03-14 18:33:50 UTC
xref: https://github.com/google/cadvisor/pull/1051
https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed.

I'll re-eval master to see if there is leakage.

Comment 11 Scott Dodson 2016-03-14 19:52:42 UTC
https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot

Comment 12 Andy Goldstein 2016-03-14 19:54:23 UTC
Upstream issue: https://github.com/google/cadvisor/issues/1156

Potential fix: https://github.com/google/cadvisor/pull/1158

Comment 15 DeShuai Ma 2016-03-18 10:21:41 UTC
Verify this bug on atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos.x86_64
[root@openshift-127 ~]# openshift version
openshift v3.1.1.6-29-g9a3b53e
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

openshift process cpu usage is not high.

Comment 17 errata-xmlrpc 2016-03-24 15:53:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0510


Note You need to log in before you can comment on or make changes to this bug.