Hide Forgot
Created attachment 1221519 [details] Logs SuperVDSM eat 200% cpu. Restart help for 15-30 secs.... 170861 root 15 -5 855416 37828 15808 S 200,0 0,0 6:41.95 supervdsmServer
Samples: 3K of event 'cycles:pp', Event count (approx.): 116210851800, Thread: supervdsmServer Overhead Command Shared Object Symbol ◆ 86,04% supervdsmServer [kernel.kallsyms] [k] gather_pte_stats ▒ 6,51% supervdsmServer [kernel.kallsyms] [k] gather_stats ▒ 2,46% supervdsmServer [kernel.kallsyms] [k] vm_normal_page ▒ 0,06% supervdsmServer [kernel.kallsyms] [k] native_write_msr ▒ 0,03% supervdsmServer [kernel.kallsyms] [k] sched_avg_update ▒ 0,03% supervdsmServer [kernel.kallsyms] [k] resched_curr ▒ 0,01% supervdsmServer [kernel.kallsyms] [k] __wake_up_bit ▒ 0,00% supervdsmServer [kernel.kallsyms] [k] ghes_proc_in_irq ▒ 0,00% supervdsmServer [kernel.kallsyms] [k] intel_bts_enable_local
[root@intel3 ~]# uname -a Linux intel3.hosts.bigdc.vir.open-bs.com 4.8.6-1.el7.elrepo.x86_64 #1 SMP Mon Oct 31 12:56:11 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux S2600GZ 1 cpu E5-2690 v.1 @ 2.90GHz 128 GB mem. Infiniband CX-2
I see way too many network operations in supervdsm log, let's start there also mom is still suffering (or rather others are suffering due to mom) from bug 1374988
HOST tab shows are always 100% network load during migration because we use 40 Gigabits infiniband set.Ya do not think it's related to the problem. Immediately after the restart SuperVDSM begins to eat 200% CPU load is not dependent on the network interface.
There are plenty of network_caps calls because Vdsm is bombarded with getCapabilities call every 10 seconds. Michal, can you tell when this can happen? Badalyan, can you attach engine logs, and provide its version info? Can you tell more about the state of the host in Engine? What is it doing? is it different than other hosts in your datacenter? Does the load disappear when you move the host to maintenance (it should).
Ah. All those are originating from localhost over xmlrpc - hence it is MOM. Moving to SLA, thanks Dan. Martin, can you check/confirm?
MOM does not use getCapabilities at all. However the hosted engine services do in the bridge monitoring code. The 10 sec period would indicate that as well. But one request per 10 seconds should not be that hard to process (definitely not hard enough to eat 200% of cpu). The other MOM issue Michal is referencing is already fixed in 4.0.6 (and it does not use supervdsm anyway). Can you please test what happens when you disable mom-vdsm and hosted engine services (ovirt-ha-agent, ovirt-ha-broker)?
Meital, can you try and reproduce?
Can you please test what happens when you disable mom-vdsm and hosted engine services (ovirt-ha-agent, ovirt-ha-broker)?
Closing as we're missing the information. If you can provide the relevant log files, please reopen and attach the files.