1314495 – Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.

Bug 1314495 - Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high.

Summary: Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node pro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.1.1
Assignee:	Andy Goldstein
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-03-03 18:46 UTC by Ryan Howe
Modified:	2019-11-14 07:32 UTC (History)
CC List:	11 users (show)
Fixed In Version:	atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos
Doc Type:	Bug Fix
Doc Text:	Cause: cadvisor was improperly collecting network stats for processes it did not need to gather stats for. Consequence: Increased CPU utilization may have occurred. Fix: cadvisor now only collects network stats for relevant processes. Result: Significantly decreased CPU utilization.
Clone Of:
Environment:
Last Closed:	2016-03-24 15:53:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0510	0	normal	SHIPPED_LIVE	Red Hat OpenShift Enterprise bug fix update	2016-03-24 19:53:32 UTC

Description Ryan Howe 2016-03-03 18:46:15 UTC

Description of problem: 

After upgrading to the latest RHEL 7.2 kernel, 3.10.0-327.10.1.el7.x86_64, the load on the OSE3 nodes is much higher.

 cadvior time spent in regexp to high 

Version-Release number of selected component (if applicable): 3.1.1.6

Steps to Reproduce:
1. run top  and see openshift process running high on cpu 
   
2. echo "OPENSHIFT_PROFILE=cpu" >> /etc/sysconfig/atomic-openshift-node
3. systemctl restart atomic-openshift-node
4. review cpu.prof

Actual results:

 higher than wanted cpu load and time spent in regext  

Expected results:

  less time spent in regexp

Comment 3 Andy Goldstein 2016-03-04 14:58:10 UTC

I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results.

Comment 4 Andy Goldstein 2016-03-07 15:52:00 UTC

We believe the performance improvements in 3.2 should significantly reduce the CPU load.

Ryan, is it ok to close this?

Comment 5 Ryan Howe 2016-03-08 20:40:10 UTC

That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load?

Comment 6 Andy Goldstein 2016-03-08 20:44:35 UTC

Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load.

Comment 7 Andy Goldstein 2016-03-14 16:32:36 UTC

Reopening as this is a definite bug in 3.1.1.6. Performing root cause analysis now.

Comment 9 Andy Goldstein 2016-03-14 18:26:32 UTC

At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second.

I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container.

Comment 10 Timothy St. Clair 2016-03-14 18:33:50 UTC

xref: https://github.com/google/cadvisor/pull/1051
https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed.

I'll re-eval master to see if there is leakage.

Comment 11 Scott Dodson 2016-03-14 19:52:42 UTC

https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot

Comment 12 Andy Goldstein 2016-03-14 19:54:23 UTC

Upstream issue: https://github.com/google/cadvisor/issues/1156

Potential fix: https://github.com/google/cadvisor/pull/1158

Comment 15 DeShuai Ma 2016-03-18 10:21:41 UTC

Verify this bug on atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos.x86_64
[root@openshift-127 ~]# openshift version
openshift v3.1.1.6-29-g9a3b53e
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

openshift process cpu usage is not high.

Comment 17 errata-xmlrpc 2016-03-24 15:53:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0510

Note You need to log in before you can comment on or make changes to this bug.