Bug 1314495

Summary: Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: NodeAssignee: Andy Goldstein <agoldste>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: agrimm, aos-bugs, haowang, jeder, jokerman, mmccomas, rhowe, sdodson, sten, tstclair, vrutkovs
Target Milestone: ---Keywords: Reopened
Target Release: 3.1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos Doc Type: Bug Fix
Doc Text:
Cause: cadvisor was improperly collecting network stats for processes it did not need to gather stats for. Consequence: Increased CPU utilization may have occurred. Fix: cadvisor now only collects network stats for relevant processes. Result: Significantly decreased CPU utilization.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-24 15:53:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1303130    

Description Ryan Howe 2016-03-03 18:46:15 UTC
Description of problem: 

After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher.

 cadvior time spent in regexp to high 

Version-Release number of selected component (if applicable): 3.1.1.6

Steps to Reproduce:
1. run top  and see openshift process running high on cpu 
   
2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node
3. systemctl restart atomic-openshift-node
4. review cpu.prof

Actual results:

 higher than wanted cpu load and time spent in regext  

Expected results:

  less time spent in regexp

Comment 3 Andy Goldstein 2016-03-04 14:58:10 UTC
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.

Comment 4 Andy Goldstein 2016-03-07 15:52:00 UTC
We believe the performance improvements in 3.2 should significantly reduce the CPU load.

Ryan, is it ok to close this?

Comment 5 Ryan Howe 2016-03-08 20:40:10 UTC
That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?

Comment 6 Andy Goldstein 2016-03-08 20:44:35 UTC
Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.

Comment 7 Andy Goldstein 2016-03-14 16:32:36 UTC
Reopening as this is a definite bug in 3.1.1.6. Performing root cause analysis now.

Comment 9 Andy Goldstein 2016-03-14 18:26:32 UTC
At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second.

I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.

Comment 10 Timothy St. Clair 2016-03-14 18:33:50 UTC
xref: https://github.com/google/cadvisor/pull/1051
https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed.

I'll re-eval master to see if there is leakage.

Comment 11 Scott Dodson 2016-03-14 19:52:42 UTC
https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot

Comment 12 Andy Goldstein 2016-03-14 19:54:23 UTC
Upstream issue: https://github.com/google/cadvisor/issues/1156

Potential fix: https://github.com/google/cadvisor/pull/1158

Comment 15 DeShuai Ma 2016-03-18 10:21:41 UTC
Verify this bug on atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos.x86_64
[root@openshift-127 ~]# openshift version
openshift v3.1.1.6-29-g9a3b53e
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

openshift process cpu usage is not high.

Comment 17 errata-xmlrpc 2016-03-24 15:53:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0510