Description of problem: - After updating from OCP 3.6 to OCP 3.7, OpenShift Master Controllers processes is constantly consuming high CPU. - At a time only one of the master have high CPU utilization. - The master with active controller role always have high CPU usage. And when any other master take active controller role (after restart of masters-controllers service), the CPU usage on that master node spikes and CPU usage drops on former master which had active controller role. - I see "mvcc: required revision has been compacted" messages in logs. However it does not appear to be related to the high CPU usage? [...] atomic-openshift-master-controllers[121072]: W0202 13:10:33.934019 121072 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/p kg/client/informers/informers_generated/internalversion/factory.go:72: watch of *rbac.Role ended with: etcdserver: mvcc: required revision has been compacted atomic-openshift-master-controllers[121072]: W0202 13:11:15.941157 121072 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/p kg/client/informers/informers_generated/externalversions/factory.go:72: watch of *v1.StorageClass ended with: etcdserver: mvcc: required revision has been compacted [...] - Captured perf, pprof data and see openshift master controller using around 52% CPU and around 45% CPU is idle. - I see most of the time is being spent in 'wait.JitterUntil' and futex related calls. Kernel version 3.10.0-693.11.6.el7 OpenShift version 3.7.23-1 From perf data: -------------- Samples: 238K of event 'cpu-clock', Event count (approx.): 59744250000 Children Self Command + 51.73% 51.73% openshift + 45.06% 45.06% swapper + 1.39% 1.39% etcd 0.49% 0.49% fluentd Expanding openshift process: Samples: 238K of event 'cpu-clock', Event count (approx.): 59744250000 Children Self Command - 51.73% 51.73% openshift - 46.20% runtime.goexit - 24.61% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until - 24.48% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil - 10.96% runtime.selectgo - 10.32% runtime.selectgoImpl - 8.67% runtime.mcall - 8.61% runtime.park_m - 8.40% runtime.schedule - 7.85% runtime.findrunnable - 5.95% runtime.stopm - 5.49% runtime.notesleep - 5.21% runtime.futex - 4.51% system_call_fastpath - 4.48% sys_futex - 4.42% do_futex - 4.36% futex_wait - 3.19% futex_wait_queue_me - 3.04% schedule - 3.03% __schedule 2.71% finish_task_switch + 0.84% futex_wait_setup + 0.89% runtime.runqsteal - 8.25% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.resetOrReuseTimer - 7.95% time.(*Timer).Reset - 7.67% time.startTimer + 7.63% runtime.addtimer + 4.61% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 - 18.49% runtime.timerproc - 11.85% runtime.notetsleepg - 5.28% runtime.notetsleep_internal - 4.79% runtime.futex - 4.14% system_call_fastpath - 4.09% sys_futex - 3.91% do_futex - 3.82% futex_wait + 2.56% futex_wait_queue_me + 0.68% futex_wait_setup + 4.98% runtime.systemstack + 1.03% runtime.exitsyscall + 5.41% time.sendTime + 2.24% 0x7f8f7b193eb0 + 1.68% runtime.futex 0.67% 0x5acacf0 From pprof data: --------------- (pprof) top 15.08s of 26.54s total (56.82%) Dropped 232 nodes (cum <= 0.13s) Showing top 10 nodes out of 80 (cum >= 5.30s) flat flat% sum% cum cum% 0 0% 0% 21.08s 79.43% runtime.goexit 14.53s 54.75% 54.75% 14.53s 54.75% runtime.futex 0.10s 0.38% 55.12% 11.08s 41.75% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil 0 0% 55.12% 11.08s 41.75% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until 0.18s 0.68% 55.80% 9.83s 37.04% runtime.timerproc 0.07s 0.26% 56.07% 9.48s 35.72% runtime.notewakeup 0.02s 0.075% 56.14% 9.45s 35.61% runtime.futexwakeup 0.11s 0.41% 56.56% 8.19s 30.86% runtime.systemstack 0.02s 0.075% 56.63% 5.33s 20.08% runtime.notetsleepg 0.05s 0.19% 56.82% 5.30s 19.97% runtime.startm (pprof) peek 23.55s of 26.54s total (88.73%) Dropped 232 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 0 0% 0% 21.08s 79.43% | runtime.goexit 11.05s 52.92% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until 9.83s 47.08% | runtime.timerproc ----------------------------------------------------------+------------- 9.43s 64.90% | runtime.futexwakeup 5.10s 35.10% | runtime.futexsleep 14.53s 54.75% 54.75% 14.53s 54.75% | runtime.futex ----------------------------------------------------------+------------- 11.08s 100% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until 0.10s 0.38% 55.12% 11.08s 41.75% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil 4.71s 43.21% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.resetOrReuseTimer 3.97s 36.42% | runtime.selectgo 2.22s 20.37% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 ----------------------------------------------------------+------------- 11.05s 100% | runtime.goexit 0 0% 55.12% 11.08s 41.75% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until 11.08s 100% | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil ----------------------------------------------------------+------------- 9.83s 100% | runtime.goexit 0.18s 0.68% 55.80% 9.83s 37.04% | runtime.timerproc 5.33s 55.29% | runtime.notetsleepg 3.90s 40.46% | time.sendTime 0.22s 2.28% | runtime.unlock 0.14s 1.45% | runtime.siftdownTimer 0.05s 0.52% | runtime.lock ----------------------------------------------------------+------------- 5.17s 54.54% | runtime.startm 4.26s 44.94% | runtime.addtimerLocked 0.05s 0.53% | runtime.exitsyscallfast_pidle 0.07s 0.26% 56.07% 9.48s 35.72% | runtime.notewakeup 9.41s 100% | runtime.futexwakeup ----------------------------------------------------------+------------- 9.41s 99.58% | runtime.notewakeup 0.04s 0.42% | runtime.unlock 0.02s 0.075% 56.14% 9.45s 35.61% | runtime.futexwakeup 9.43s 100% | runtime.futex
From checking the logs it looks like the problem has been addressed in the latest version of openshift (3.9). See the attached BZs for the exact details what has been fixed. Moving this bug to on-qa.
Move to verify acorrding to the attached fixed bugs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2013