Description of problem: VMs intermittently go into not responding status and then comes back within around 20 seconds. While in the unresponsive state, I'm unable to perform any action on the VM from the ovirt-engine. I'm running about 40 vms across 3 hosts. Version-Release number of selected component (if applicable): ovirt 4.1.1 How reproducible: It seems to happen intermittently throughout the day. No real pattern Steps to Reproduce: 1. Can wait for it to happen on its own or putting one of the hosts into maintenance mode triggers vm migration and sometimes alot of vms go into unresponsive state 2. 3. Actual results: Expected results: I don't see either CPU, memory or network loaded heavily or at all when this happens so I expect reporting of vm status to be stable Additional info: vdsm logs on the hosts show quite these messages: 2017-05-11 09:31:06,112+0100 WARN (jsonrpc/2) [virt.vm] (vmId='c9618969-891b-46b3-b0dc-cf3a29c27bb1') mon itor became unresponsive (command timeout, age=64.8899999997) (vm:5013) engine logs: 2017-05-11 09:30:14,344+01 INFO [org.ovirt.engine.core.bll.aaa.SessionDataContainer] (De faultQuartzScheduler10) [877a5ba1-3836-4e0f-a1bd-b43c5b40ad5d] Not removing session 'la6m BsKLPeLesZjR9ZxonPw7TIYxqXePB9Tw/q2y7UdB56RXIacI04VeCSZ1tdFjss+COlNaxTtDmG8uymwFQQ==', se ssion has running commands for user 'admin@internal-authz'. 2017-05-11 09:30:26,438+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '9091cea4-3b82-4a02-8 b08-cd220abd1321'(Win7_x64test13) moved from 'NotResponding' --> 'Up' 2017-05-11 09:30:26,440+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM 'b761677d-e7d3-4849-8 8c1-196a39b398f7'(Win7_x64test17) moved from 'NotResponding' --> 'Up' 2017-05-11 09:30:26,441+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '2194ebf1-7ae9-4423-9 952-4cb1de36eef0'(Win7_x64test15) moved from 'Up' --> 'NotResponding' 2017-05-11 09:30:26,495+01 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.Aud itLogDirector] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes sage: VM Win7_x64test15 is not responding. 2017-05-11 09:30:26,496+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '883e1b7f-41b8-47cf-a 381-c1de2c4b6e72'(Win7_x64test11) moved from 'Up' --> 'NotResponding' 2017-05-11 09:30:26,512+01 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.Aud itLogDirector] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes sage: VM Win7_x64test11 is not responding. 2017-05-11 09:30:53,216+01 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGu estDistributionBalancePolicyUnit] (DefaultQuartzScheduler7) [825539cf-8b47-4033-8ed2-9b42 97c6d626] There is no host with less than 9 running guests 2017-05-11 09:30:53,216+01 WARN [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler7) [825539cf-8b47-4033-8ed2-9b4297c6d626] All candidate hosts have been filtered, can't balance the cluster 'Cluster01' based on the CPU usage, will try memory based approach
Typically this indicates overloaded host. Please add some performance indicators, cpu usage, load, sar, iotop, etc.
Created attachment 1279016 [details] sar logs for 3 ovirt hosts some sar logs.
Hi Thanks for getting back to me. As I mentioned in my post, the host doesn't seem to be loaded at all. Ovirt shows about 45% memory usage per host, Cpu usually 1%, and network 0%. These VMs are just powered on. No special services running, just idling windows 7 VMs so there's no reason for host to be overloaded. I've attached some sar logs for the day for each host anyway.
hey, ok, in that case we need some more logs. Can you please provide the full VDSM, libvirt, qemu and engine logs here?
Created attachment 1279243 [details] vdsm log node01
Created attachment 1279245 [details] vdsm log node02
Created attachment 1279246 [details] vdsm log node03
Created attachment 1279249 [details] ovirt engine logs
I noticed there aren't any libvirt logs in /var/log/vdsm/ on any hosts. The last time the host went to not responding was 02:52:02AM and there have been intermittent events until 10:46:07 according to the engine events. The libvirtd service status shows following: libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/libvirtd.service.d └─unlimited-core.conf Active: active (running) since Fri 2017-05-05 15:14:39 BST; 1 weeks 3 days ago Docs: man:libvirtd(8) http://libvirt.org Main PID: 3996 (libvirtd) CGroup: /system.slice/libvirtd.service └─3996 /usr/sbin/libvirtd --listen May 16 05:28:10 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test14; current job is (query, none) owned by (46498 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (45s, 0s) May 16 05:28:10 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) May 16 05:28:41 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test14; current job is (query, none) owned by (46498 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (76s, 0s) May 16 05:28:41 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) May 16 05:30:25 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test19; current job is (query, none) owned by (4001 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s) May 16 05:30:25 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) May 16 05:32:35 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test16; current job is (query, none) owned by (4002 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s) May 16 05:32:35 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) May 16 05:33:05 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test16; current job is (query, none) owned by (4002 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (60s, 0s) May 16 05:33:05 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
This looks like the same issue as 1419856 which will be fixed in 4.1.2 Im closing this bug as duplicate of that. If you will face the same issue also after an update to 4.1.2 please reopen and we can keep investigating. *** This bug has been marked as a duplicate of bug 1419856 ***