| Summary: | Updating to latest RHEL 7.2 Kernel causes higher load with Openshift Node process, time spent in regexp to high. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ryan Howe <rhowe> |
| Component: | Node | Assignee: | Andy Goldstein <agoldste> |
| Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.1.0 | CC: | agrimm, aos-bugs, haowang, jeder, jokerman, mmccomas, rhowe, sdodson, sten, tstclair, vrutkovs |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 3.1.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos | Doc Type: | Bug Fix |
| Doc Text: |
Cause: cadvisor was improperly collecting network stats for processes it did not need to gather stats for.
Consequence: Increased CPU utilization may have occurred.
Fix: cadvisor now only collects network stats for relevant processes.
Result: Significantly decreased CPU utilization.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-03-24 15:53:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1303130 | ||
|
Description
Ryan Howe
2016-03-03 18:46:15 UTC
I don't believe the kernel is the culprit. I downgraded to 3.10.0-327.4.5 and am still seeing similar cpu profiling results. We believe the performance improvements in 3.2 should significantly reduce the CPU load. Ryan, is it ok to close this? That is fine with me but do we have any tracker that will confirm 3.2 reduces CPU load? Jeremy & Tim & team have been performance testing 3.1 vs 3.2 and they have results that show a significant reduction in CPU load. Reopening as this is a definite bug in 3.1.1.6. Performing root cause analysis now. At least 1 issue is that cadvisor is spawning multiple goroutines (1 per cgroup) that are all opening /proc/$openshift_pid/net/dev, reading it, scanning through it, and running regexes against it. This happens once per cgroup, once a second. I have also seen a node that had ~1600 goroutines for cadvisor statistic collection (housekeeping). I'm not sure if there's also a goroutine leak, as that number seems way too high, especially when the node only had 24 pods each with 1 container. xref: https://github.com/google/cadvisor/pull/1051 https://github.com/kubernetes/kubernetes/issues/19633 - but this one was fixed. I'll re-eval master to see if there is leakage. https://github.com/kubernetes/kubernetes/pull/18178 in particular this change https://github.com/google/cadvisor/pull/942 updated up the regex bits that seem pretty hot Upstream issue: https://github.com/google/cadvisor/issues/1156 Potential fix: https://github.com/google/cadvisor/pull/1158 Verify this bug on atomic-openshift-3.1.1.6-4.git.28.0d526e5.el7aos.x86_64 [root@openshift-127 ~]# openshift version openshift v3.1.1.6-29-g9a3b53e kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 openshift process cpu usage is not high. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0510 |