1449967 – VMs intermittently go into not responding status

Bug 1449967 - VMs intermittently go into not responding status

Summary: VMs intermittently go into not responding status

Keywords:
Status:	CLOSED DUPLICATE of bug 1419856
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.20.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-11 09:25 UTC by Mahesh
Modified:	2017-05-18 07:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-05-18 07:12:17 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sar logs for 3 ovirt hosts (1.81 KB, application/x-7z-compressed) 2017-05-15 14:30 UTC, Mahesh	no flags	Details
vdsm log node01 (16.46 MB, application/x-gzip) 2017-05-16 09:48 UTC, Mahesh	no flags	Details
vdsm log node02 (18.06 MB, application/x-gzip) 2017-05-16 09:49 UTC, Mahesh	no flags	Details
vdsm log node03 (14.82 MB, application/x-gzip) 2017-05-16 09:50 UTC, Mahesh	no flags	Details
ovirt engine logs (3.15 MB, application/x-gzip) 2017-05-16 09:56 UTC, Mahesh	no flags	Details
View All

Description Mahesh 2017-05-11 09:25:40 UTC

Description of problem:
VMs intermittently go into not responding status and then comes back within around 20 seconds. While in the unresponsive state, I'm unable to perform any action on the VM from the ovirt-engine. I'm running about 40 vms across 3 hosts.

Version-Release number of selected component (if applicable):

ovirt 4.1.1

How reproducible:
It seems to happen intermittently throughout the day. No real pattern

Steps to Reproduce:
1. Can wait for it to happen on its own or putting one of the hosts into maintenance mode triggers vm migration and sometimes alot of vms go into unresponsive state
2.
3.

Actual results:


Expected results:
I don't see either CPU, memory or network loaded heavily or at all when this happens so I expect reporting of vm status to be stable

Additional info:

vdsm logs on the hosts show quite these messages:

2017-05-11 09:31:06,112+0100 WARN  (jsonrpc/2) [virt.vm] (vmId='c9618969-891b-46b3-b0dc-cf3a29c27bb1') mon
itor became unresponsive (command timeout, age=64.8899999997) (vm:5013)


engine logs:

2017-05-11 09:30:14,344+01 INFO  [org.ovirt.engine.core.bll.aaa.SessionDataContainer] (De
faultQuartzScheduler10) [877a5ba1-3836-4e0f-a1bd-b43c5b40ad5d] Not removing session 'la6m
BsKLPeLesZjR9ZxonPw7TIYxqXePB9Tw/q2y7UdB56RXIacI04VeCSZ1tdFjss+COlNaxTtDmG8uymwFQQ==', se
ssion has running commands for user 'admin@internal-authz'.
2017-05-11 09:30:26,438+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '9091cea4-3b82-4a02-8
b08-cd220abd1321'(Win7_x64test13) moved from 'NotResponding' --> 'Up'
2017-05-11 09:30:26,440+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM 'b761677d-e7d3-4849-8
8c1-196a39b398f7'(Win7_x64test17) moved from 'NotResponding' --> 'Up'
2017-05-11 09:30:26,441+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '2194ebf1-7ae9-4423-9
952-4cb1de36eef0'(Win7_x64test15) moved from 'Up' --> 'NotResponding'
2017-05-11 09:30:26,495+01 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.Aud
itLogDirector] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] EVENT_ID:
 VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes
sage: VM Win7_x64test15 is not responding.
2017-05-11 09:30:26,496+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
(DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] VM '883e1b7f-41b8-47cf-a
381-c1de2c4b6e72'(Win7_x64test11) moved from 'Up' --> 'NotResponding'
2017-05-11 09:30:26,512+01 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.Aud
itLogDirector] (DefaultQuartzScheduler1) [825539cf-8b47-4033-8ed2-9b4297c6d626] EVENT_ID:
 VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes
sage: VM Win7_x64test11 is not responding.
2017-05-11 09:30:53,216+01 WARN  [org.ovirt.engine.core.bll.scheduling.policyunits.EvenGu
estDistributionBalancePolicyUnit] (DefaultQuartzScheduler7) [825539cf-8b47-4033-8ed2-9b42
97c6d626] There is no host with less than 9 running guests
2017-05-11 09:30:53,216+01 WARN  [org.ovirt.engine.core.bll.scheduling.policyunits.CpuAndMemoryBalancingPolicyUnit] (DefaultQuartzScheduler7) [825539cf-8b47-4033-8ed2-9b4297c6d626] All candidate hosts have been filtered, can't balance the cluster 'Cluster01' based on the CPU usage, will try memory based approach

Comment 1 Michal Skrivanek 2017-05-12 04:51:09 UTC

Typically this indicates overloaded host. Please add some performance indicators, cpu usage, load, sar, iotop, etc.

Comment 2 Mahesh 2017-05-15 14:30:37 UTC

Created attachment 1279016 [details]
sar logs for 3 ovirt hosts

some sar logs.

Comment 3 Mahesh 2017-05-15 14:31:01 UTC

Hi

Thanks for getting back to me. As I mentioned in my post, the host doesn't seem to be loaded at all. Ovirt shows about 45% memory usage per host, Cpu usually 1%,  and network 0%.

These VMs are just powered on. No special services running, just idling windows 7 VMs so there's no reason for host to be overloaded.

I've attached some sar logs for the day for each host anyway.

Comment 4 Tomas Jelinek 2017-05-16 08:17:14 UTC

hey,

ok, in that case we need some more logs.
Can you please provide the full VDSM, libvirt, qemu and engine logs here?

Comment 5 Mahesh 2017-05-16 09:48:48 UTC

Created attachment 1279243 [details]
vdsm log node01

Comment 6 Mahesh 2017-05-16 09:49:41 UTC

Created attachment 1279245 [details]
vdsm log node02

Comment 7 Mahesh 2017-05-16 09:50:39 UTC

Created attachment 1279246 [details]
vdsm log node03

Comment 8 Mahesh 2017-05-16 09:56:57 UTC

Created attachment 1279249 [details]
ovirt engine logs

Comment 9 Mahesh 2017-05-16 10:15:01 UTC

I noticed there aren't any libvirt logs in /var/log/vdsm/ on any hosts. The last time the host went to not responding was 02:52:02AM and there have been intermittent events until 10:46:07 according to the engine events.

The libvirtd service status shows following:

libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
           └─unlimited-core.conf
   Active: active (running) since Fri 2017-05-05 15:14:39 BST; 1 weeks 3 days ago
     Docs: man:libvirtd(8)
           http://libvirt.org
 Main PID: 3996 (libvirtd)
   CGroup: /system.slice/libvirtd.service
           └─3996 /usr/sbin/libvirtd --listen

May 16 05:28:10 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test14; current job is (query, none) owned by (46498 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (45s, 0s)
May 16 05:28:10 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
May 16 05:28:41 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test14; current job is (query, none) owned by (46498 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (76s, 0s)
May 16 05:28:41 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
May 16 05:30:25 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test19; current job is (query, none) owned by (4001 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s)
May 16 05:30:25 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
May 16 05:32:35 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test16; current job is (query, none) owned by (4002 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s)
May 16 05:32:35 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
May 16 05:33:05 node03.loc libvirtd[3996]: Cannot start job (query, none) for domain Win7_x64test16; current job is (query, none) owned by (4002 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (60s, 0s)
May 16 05:33:05 node03.loc libvirtd[3996]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

Comment 10 Tomas Jelinek 2017-05-18 07:12:17 UTC

This looks like the same issue as 1419856 which will be fixed in 4.1.2

Im closing this bug as duplicate of that. If you will face the same issue also after an update to 4.1.2 please reopen and we can keep investigating.

*** This bug has been marked as a duplicate of bug 1419856 ***

Note You need to log in before you can comment on or make changes to this bug.