866484 – vm state keeps changing from up to non-responsive between libvirt virDomainGetBlockInfo queries

Bug 866484 - vm state keeps changing from up to non-responsive between libvirt virDomainGetBlockInfo queries

Summary: vm state keeps changing from up to non-responsive between libvirt virDomainGe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Francesco Romani
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-15 13:31 UTC by Dafna Ron
Modified:	2016-03-09 19:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:	3.6.0-4 alpha3
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-03-09 19:24:00 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.01 MB, application/x-gzip) 2012-10-15 13:31 UTC, Dafna Ron	no flags	Details
stats data and extraction script (5.57 KB, application/x-compressed-tar) 2014-01-13 14:34 UTC, Francesco Romani	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0362	0	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC
oVirt gerrit	23138	0	master	ABANDONED	vm: pessimistic monitor response check	Never

Description Dafna Ron 2012-10-15 13:31:09 UTC

Created attachment 627427 [details]
logs

Description of problem:

I ran kill -STOP on qemu process and the vm running kept changing state in vdsm from UP to UP*
looking at libvirt log the vm seems to change to UP state between virDomainGetBlockInfo until results come back with cannot acquire state change lock each time. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-37.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.5.x86_64
qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

How reproducible:

100%

Steps to Reproduce:
1. run kill -STOP <qemu pid> 
2.
3.
  
Actual results:

vm state keeps changing between queries to UP and once the queries returns it changes to non-responding. 

Expected results:

vm should remain in non-responding state until states return correctly. 

Additional info:logs


2012-10-15 12:40:07.000+0000: 30250: debug : virDomainFree:2313 : dom=0x7f30a4000f80, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915)
2012-10-15 12:40:09.010+0000: 30248: debug : virDomainGetBlockInfo:7768 : dom=0x7f30b0000a10, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915), info=0x7f30c1678b20, flags=0
2012-10-15 12:40:39.000+0000: 30248: error : qemuDomainObjBeginJobInternal:776 : Timed out during operation: cannot acquire state change lock

engine: 

2012-10-15 15:20:21,999 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) [78ebe1b3] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:21:27,974 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-68) [2043e9e6] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:22:06,748 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-70) [444f5223] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:22:32,081 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-95) [2e9121ee] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:23:09,001 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-87) [57adc17b] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:24:08,006 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-41) [244a2be8] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:24:42,704 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-9) [44f62617] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding

Comment 1 RHEL Program Management 2012-12-14 07:04:15 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 Itamar Heim 2013-12-01 10:12:59 UTC

still relevant?

Comment 5 Francesco Romani 2014-01-10 12:06:28 UTC

cannot reproduce yet using:

vdsm 4.14.0 (from git master)
libvirt 1.0.5.8
qemu-kvm 1.4.2

using an iscsi storage built with
scsi-target-utils-1.0.38
iscsi-initiator-utils-6.2.0.873-17

vm stays 'Not Responding' all the time and do not bounces between states as reported.

Comment 7 Francesco Romani 2014-01-13 14:34:56 UTC

Created attachment 849406 [details]
stats data and extraction script

Comment 8 Francesco Romani 2014-01-13 14:36:09 UTC

Analysis of the log data suggests that the problem is that the monitor is sometimes mistakely detected as Up.

Example data (key: logdate logtime monitorResponse(value) statsTime):
2012-10-15 14:53:18,267 monitorResponse=(-1) 25.26
2012-10-15 14:53:22,503 monitorResponse=(-1) 29.49
2012-10-15 14:53:24,554 monitorResponse=(0) 0.55
2012-10-15 14:54:03,687 monitorResponse=(-1) 6.68
2012-10-15 14:54:05,731 monitorResponse=(-1) 8.73
2012-10-15 14:54:07,769 monitorResponse=(-1) 10.77
2012-10-15 14:54:09,815 monitorResponse=(-1) 12.81
2012-10-15 14:54:14,021 monitorResponse=(-1) 17.02
2012-10-15 14:54:16,091 monitorResponse=(-1) 19.09
2012-10-15 14:54:18,126 monitorResponse=(-1) 21.12
2012-10-15 14:54:20,164 monitorResponse=(-1) 23.16
2012-10-15 14:54:24,343 monitorResponse=(-1) 27.34
2012-10-15 14:54:26,381 monitorResponse=(-1) 29.38
2012-10-15 14:54:28,575 monitorResponse=(-1) 31.57
2012-10-15 14:54:30,621 monitorResponse=(-1) 1.62
2012-10-15 14:54:35,379 monitorResponse=(-1) 6.38
2012-10-15 14:54:37,418 monitorResponse=(-1) 8.41
2012-10-15 14:54:39,469 monitorResponse=(-1) 10.47
2012-10-15 14:54:41,539 monitorResponse=(-1) 12.53
2012-10-15 14:54:46,007 monitorResponse=(-1) 17.00
2012-10-15 14:54:48,046 monitorResponse=(-1) 19.04
2012-10-15 14:54:50,085 monitorResponse=(-1) 21.08
2012-10-15 14:54:52,125 monitorResponse=(-1) 23.12
2012-10-15 14:54:56,366 monitorResponse=(-1) 27.36
2012-10-15 14:54:58,411 monitorResponse=(-1) 29.41
2012-10-15 14:55:00,584 monitorResponse=(0) 31.58
2012-10-15 14:55:37,436 monitorResponse=(-1) 4.43

where monitorResponse and statsTime are taken verbatim from the output;
monitorResponse: 0 if monitor is detected responsive, -1 otherwise;
statsTime: delta (secs) wrt last sample time


data extracted with

$ grep monitorResponse vdsm.log | getstats.py

where getstats.py is attached; full data attached as well.

A patch to mitigate the effects of this bouncing behaviour is being discussed at http://gerrit.ovirt.org/#/c/23138/

Comment 9 Michal Skrivanek 2014-08-26 08:52:36 UTC

we didn't have a proper discussion about the proper behavior, which issues should trigger non-resonsive state and which we can ignore
moving out of 3.5

Comment 10 Francesco Romani 2015-06-23 14:45:05 UTC

The new sampling code using bulk stats, included in 3.6.0, is expected to solve this issue among other general improvements.

Comment 12 Nisim Simsolo 2015-11-01 16:57:43 UTC

Verified: 
rhevm-3.6.0.1-0.1.el6 (3.6.0-17)
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM
2. kill -STOP <qemu pid>
Verify VM state does not change from up to up/down in libvirt, VDSM and engine logs (NotResponding state)
Verify no other ERROR occurs in the logs
Verify VM state remains up in webadmin (though actions cannot be done from console in this state)
3. kill -CONT <qemu pid>
Verify VM continues to run properly.
4. Repeat steps 1-3 few times.

Comment 13 Francesco Romani 2016-01-19 14:56:15 UTC

I don't think this BZ requires doc text. This fix should be transparent to the users.

Comment 15 errata-xmlrpc 2016-03-09 19:24:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.