Bug 1396031 - SuperVDSM ger 200% cpu [NEEDINFO]
Summary: SuperVDSM ger 200% cpu
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: vdsm
Classification: oVirt
Component: SuperVDSM
Version: 4.18.15.2
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ---
: ---
Assignee: Martin Sivák
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-17 09:59 UTC by Badalyan Vyacheslav
Modified: 2016-12-14 11:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 11:46:42 UTC
oVirt Team: SLA
msivak: needinfo? (v.badalyan)


Attachments (Terms of Use)
Logs (3.64 MB, application/zip)
2016-11-17 09:59 UTC, Badalyan Vyacheslav
no flags Details

Description Badalyan Vyacheslav 2016-11-17 09:59:28 UTC
Created attachment 1221519 [details]
Logs

SuperVDSM eat 200% cpu. Restart help for 15-30 secs....

170861 root      15  -5  855416  37828  15808 S 200,0  0,0   6:41.95 supervdsmServer

Comment 1 Badalyan Vyacheslav 2016-11-17 10:02:28 UTC
Samples: 3K of event 'cycles:pp', Event count (approx.): 116210851800, Thread: supervdsmServer
Overhead  Command          Shared Object      Symbol                                                                                                                                                              ◆
  86,04%  supervdsmServer  [kernel.kallsyms]  [k] gather_pte_stats                                                                                                                                                ▒
   6,51%  supervdsmServer  [kernel.kallsyms]  [k] gather_stats                                                                                                                                                    ▒
   2,46%  supervdsmServer  [kernel.kallsyms]  [k] vm_normal_page                                                                                                                                                  ▒
   0,06%  supervdsmServer  [kernel.kallsyms]  [k] native_write_msr                                                                                                                                                ▒
   0,03%  supervdsmServer  [kernel.kallsyms]  [k] sched_avg_update                                                                                                                                                ▒
   0,03%  supervdsmServer  [kernel.kallsyms]  [k] resched_curr                                                                                                                                                    ▒
   0,01%  supervdsmServer  [kernel.kallsyms]  [k] __wake_up_bit                                                                                                                                                   ▒
   0,00%  supervdsmServer  [kernel.kallsyms]  [k] ghes_proc_in_irq                                                                                                                                                ▒
   0,00%  supervdsmServer  [kernel.kallsyms]  [k] intel_bts_enable_local

Comment 2 Badalyan Vyacheslav 2016-11-17 10:06:14 UTC
[root@intel3 ~]# uname -a
Linux intel3.hosts.bigdc.vir.open-bs.com 4.8.6-1.el7.elrepo.x86_64 #1 SMP Mon Oct 31 12:56:11 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

S2600GZ 
1 cpu E5-2690 v.1 @ 2.90GHz
128 GB mem.
Infiniband CX-2

Comment 3 Michal Skrivanek 2016-11-18 07:55:31 UTC
I see way too many network operations in supervdsm log, let's start there

also mom is still suffering (or rather others are suffering due to mom) from bug 1374988

Comment 4 Badalyan Vyacheslav 2016-11-18 12:56:22 UTC
HOST tab shows are always 100% network load during migration because we use 40 Gigabits infiniband set.Ya do not think it's related to the problem. Immediately after the restart SuperVDSM begins to eat 200% CPU load is not dependent on the network interface.

Comment 5 Dan Kenigsberg 2016-11-20 10:28:49 UTC
There are plenty of network_caps calls because Vdsm is bombarded with getCapabilities call every 10 seconds.

Michal, can you tell when this can happen?

Badalyan, can you attach engine logs, and provide its version info?

Can you tell more about the state of the host in Engine? What is it doing? is it different than other hosts in your datacenter? Does the load disappear when you move the host to maintenance (it should).

Comment 6 Michal Skrivanek 2016-11-23 18:34:34 UTC
Ah. All those are originating from localhost over xmlrpc - hence it is MOM. 
Moving to SLA, thanks Dan. 

Martin, can you check/confirm?

Comment 7 Martin Sivák 2016-11-23 19:25:49 UTC
MOM does not use getCapabilities at all. However the hosted engine services do in the bridge monitoring code. The 10 sec period would indicate that as well.

But one request per 10 seconds should not be that hard to process (definitely not hard enough to eat 200% of cpu). The other MOM issue Michal is referencing is already fixed in 4.0.6 (and it does not use supervdsm anyway).

Can you please test what happens when you disable mom-vdsm and hosted engine services (ovirt-ha-agent, ovirt-ha-broker)?

Comment 8 Doron Fediuck 2016-11-28 12:29:23 UTC
Meital,
can you try and reproduce?

Comment 9 Martin Sivák 2016-12-07 11:20:01 UTC
Can you please test what happens when you disable mom-vdsm and hosted engine services (ovirt-ha-agent, ovirt-ha-broker)?

Comment 10 Doron Fediuck 2016-12-14 11:46:42 UTC
Closing as we're missing the information.
If you can provide the relevant log files, please reopen and attach the files.


Note You need to log in before you can comment on or make changes to this bug.