Bug 1546097

Summary: Master controllers are using high amount of CPU after upgrade to 3.7
Product: OpenShift Container Platform Reporter: Sunil Choudhary <schoudha>
Component: MasterAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: andcosta, aos-bugs, bmchugh, jokerman, maszulik, mmccomas, per.carlson
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 18:01:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sunil Choudhary 2018-02-16 10:09:01 UTC
Description of problem:

- After updating from OCP 3.6 to OCP 3.7, OpenShift Master Controllers 
  processes is constantly consuming high CPU.
- At a time only one of the master have high CPU utilization.
- The master with active controller role always have high CPU usage. 
  And when any other master take active controller role (after restart of 
  masters-controllers service), the CPU usage on that master node spikes 
  and CPU usage drops on former master which had active controller role.
- I see "mvcc: required revision has been compacted" messages in logs. 
  However it does not appear to be related to the high CPU usage?

[...]
atomic-openshift-master-controllers[121072]: W0202 13:10:33.934019  121072 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/p
    kg/client/informers/informers_generated/internalversion/factory.go:72: watch of *rbac.Role ended with: etcdserver: mvcc: required revision has been compacted
atomic-openshift-master-controllers[121072]: W0202 13:11:15.941157  121072 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/p
    kg/client/informers/informers_generated/externalversions/factory.go:72: watch of *v1.StorageClass ended with: etcdserver: mvcc: required revision has been compacted
[...]

- Captured perf, pprof data and see openshift master controller using 
  around 52% CPU and around 45% CPU is idle.
- I see most of the time is being spent in 'wait.JitterUntil' and futex related calls.

Kernel version 3.10.0-693.11.6.el7
OpenShift version 3.7.23-1


From perf data:
--------------
Samples: 238K of event 'cpu-clock', Event count (approx.): 59744250000
  Children      Self  Command
+   51.73%    51.73%  openshift
+   45.06%    45.06%  swapper
+    1.39%     1.39%  etcd
     0.49%     0.49%  fluentd

Expanding openshift process:

Samples: 238K of event 'cpu-clock', Event count (approx.): 59744250000
  Children      Self  Command                                                                                   
-   51.73%    51.73%  openshift                                                                                 
   - 46.20% runtime.goexit                                         
      - 24.61% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until                        
         - 24.48% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil               
            - 10.96% runtime.selectgo                                                                            
               - 10.32% runtime.selectgoImpl                                                                     
                  - 8.67% runtime.mcall                                                                           
                     - 8.61% runtime.park_m                                                                       
                        - 8.40% runtime.schedule                                                             
                           - 7.85% runtime.findrunnable                                                       
                              - 5.95% runtime.stopm                                                            
                                 - 5.49% runtime.notesleep                                                     
                                    - 5.21% runtime.futex                                                     
                                       - 4.51% system_call_fastpath                                           
                                          - 4.48% sys_futex                                                   
                                             - 4.42% do_futex                                                 
                                                - 4.36% futex_wait                                           
                                                   - 3.19% futex_wait_queue_me                                
                                                      - 3.04% schedule                                      
                                                         - 3.03% __schedule                                   
                                                              2.71% finish_task_switch                       
                                                   + 0.84% futex_wait_setup                                  
                              + 0.89% runtime.runqsteal                                                      
            - 8.25% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.resetOrReuseTimer   
               - 7.95% time.(*Timer).Reset                                                                   
                  - 7.67% time.startTimer                                                                    
                     + 7.63% runtime.addtimer                                                                
            + 4.61% github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1   
      - 18.49% runtime.timerproc                                                           
         - 11.85% runtime.notetsleepg                                                                        
            - 5.28% runtime.notetsleep_internal                                                              
               - 4.79% runtime.futex                                                                          
                  - 4.14% system_call_fastpath                                                                
                     - 4.09% sys_futex                                                                       
                        - 3.91% do_futex                                                                       
                           - 3.82% futex_wait                                                                   
                              + 2.56% futex_wait_queue_me                                                        
                              + 0.68% futex_wait_setup                                                            
            + 4.98% runtime.systemstack                                                                           
            + 1.03% runtime.exitsyscall                                                                           
         + 5.41% time.sendTime                                                                                    
   + 2.24% 0x7f8f7b193eb0                                                                                         
   + 1.68% runtime.futex                                                                                           
     0.67% 0x5acacf0

From pprof data:
---------------
(pprof) top
15.08s of 26.54s total (56.82%)
Dropped 232 nodes (cum <= 0.13s)
Showing top 10 nodes out of 80 (cum >= 5.30s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     21.08s 79.43%  runtime.goexit
    14.53s 54.75% 54.75%     14.53s 54.75%  runtime.futex
     0.10s  0.38% 55.12%     11.08s 41.75%  github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
         0     0% 55.12%     11.08s 41.75%  github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
     0.18s  0.68% 55.80%      9.83s 37.04%  runtime.timerproc
     0.07s  0.26% 56.07%      9.48s 35.72%  runtime.notewakeup
     0.02s 0.075% 56.14%      9.45s 35.61%  runtime.futexwakeup
     0.11s  0.41% 56.56%      8.19s 30.86%  runtime.systemstack
     0.02s 0.075% 56.63%      5.33s 20.08%  runtime.notetsleepg
     0.05s  0.19% 56.82%      5.30s 19.97%  runtime.startm

(pprof) peek
23.55s of 26.54s total (88.73%)
Dropped 232 nodes (cum <= 0.13s)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
         0     0%     0%     21.08s 79.43%                | runtime.goexit
                                            11.05s 52.92% |   github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
                                             9.83s 47.08% |   runtime.timerproc
----------------------------------------------------------+-------------
                                             9.43s 64.90% |   runtime.futexwakeup
                                             5.10s 35.10% |   runtime.futexsleep
    14.53s 54.75% 54.75%     14.53s 54.75%                | runtime.futex
----------------------------------------------------------+-------------
                                            11.08s   100% |   github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
     0.10s  0.38% 55.12%     11.08s 41.75%                | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
                                             4.71s 43.21% |   github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.resetOrReuseTimer
                                             3.97s 36.42% |   runtime.selectgo
                                             2.22s 20.37% |   github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
----------------------------------------------------------+-------------
                                            11.05s   100% |   runtime.goexit
         0     0% 55.12%     11.08s 41.75%                | github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
                                            11.08s   100% |   github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
----------------------------------------------------------+-------------
                                             9.83s   100% |   runtime.goexit
     0.18s  0.68% 55.80%      9.83s 37.04%                | runtime.timerproc
                                             5.33s 55.29% |   runtime.notetsleepg
                                             3.90s 40.46% |   time.sendTime
                                             0.22s  2.28% |   runtime.unlock
                                             0.14s  1.45% |   runtime.siftdownTimer
                                             0.05s  0.52% |   runtime.lock
----------------------------------------------------------+-------------
                                             5.17s 54.54% |   runtime.startm
                                             4.26s 44.94% |   runtime.addtimerLocked
                                             0.05s  0.53% |   runtime.exitsyscallfast_pidle
     0.07s  0.26% 56.07%      9.48s 35.72%                | runtime.notewakeup
                                             9.41s   100% |   runtime.futexwakeup
----------------------------------------------------------+-------------
                                             9.41s 99.58% |   runtime.notewakeup
                                             0.04s  0.42% |   runtime.unlock
     0.02s 0.075% 56.14%      9.45s 35.61%                | runtime.futexwakeup
                                             9.43s   100% |   runtime.futex

Comment 19 Maciej Szulik 2018-04-05 11:22:59 UTC
From checking the logs it looks like the problem has been addressed in the latest version of openshift (3.9). See the attached BZs for the exact details what has been fixed. Moving this bug to on-qa.

Comment 20 Wang Haoran 2018-04-08 10:43:02 UTC
Move to verify acorrding to the attached fixed bugs.

Comment 22 errata-xmlrpc 2018-06-27 18:01:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2013