Bug 866484 - vm state keeps changing from up to non-responsive between libvirt virDomainGetBlockInfo queries
Summary: vm state keeps changing from up to non-responsive between libvirt virDomainGe...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: unspecified
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Francesco Romani
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-15 13:31 UTC by Dafna Ron
Modified: 2016-03-09 19:24 UTC (History)
9 users (show)

Fixed In Version: 3.6.0-4 alpha3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-09 19:24:00 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)
logs (1.01 MB, application/x-gzip)
2012-10-15 13:31 UTC, Dafna Ron
no flags Details
stats data and extraction script (5.57 KB, application/x-compressed-tar)
2014-01-13 14:34 UTC, Francesco Romani
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 23138 master ABANDONED vm: pessimistic monitor response check Never

Description Dafna Ron 2012-10-15 13:31:09 UTC
Created attachment 627427 [details]
logs

Description of problem:

I ran kill -STOP on qemu process and the vm running kept changing state in vdsm from UP to UP*
looking at libvirt log the vm seems to change to UP state between virDomainGetBlockInfo until results come back with cannot acquire state change lock each time. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-37.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.5.x86_64
qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

How reproducible:

100%

Steps to Reproduce:
1. run kill -STOP <qemu pid> 
2.
3.
  
Actual results:

vm state keeps changing between queries to UP and once the queries returns it changes to non-responding. 

Expected results:

vm should remain in non-responding state until states return correctly. 

Additional info:logs


2012-10-15 12:40:07.000+0000: 30250: debug : virDomainFree:2313 : dom=0x7f30a4000f80, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915)
2012-10-15 12:40:09.010+0000: 30248: debug : virDomainGetBlockInfo:7768 : dom=0x7f30b0000a10, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915), info=0x7f30c1678b20, flags=0
2012-10-15 12:40:39.000+0000: 30248: error : qemuDomainObjBeginJobInternal:776 : Timed out during operation: cannot acquire state change lock

engine: 

2012-10-15 15:20:21,999 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) [78ebe1b3] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:21:27,974 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-68) [2043e9e6] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:22:06,748 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-70) [444f5223] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:22:32,081 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-95) [2e9121ee] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:23:09,001 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-87) [57adc17b] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:24:08,006 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-41) [244a2be8] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:24:42,704 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-9) [44f62617] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding

Comment 1 RHEL Program Management 2012-12-14 07:04:15 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 Itamar Heim 2013-12-01 10:12:59 UTC
still relevant?

Comment 5 Francesco Romani 2014-01-10 12:06:28 UTC
cannot reproduce yet using:

vdsm 4.14.0 (from git master)
libvirt 1.0.5.8
qemu-kvm 1.4.2

using an iscsi storage built with
scsi-target-utils-1.0.38
iscsi-initiator-utils-6.2.0.873-17

vm stays 'Not Responding' all the time and do not bounces between states as reported.

Comment 7 Francesco Romani 2014-01-13 14:34:56 UTC
Created attachment 849406 [details]
stats data and extraction script

Comment 8 Francesco Romani 2014-01-13 14:36:09 UTC
Analysis of the log data suggests that the problem is that the monitor is sometimes mistakely detected as Up.

Example data (key: logdate logtime monitorResponse(value) statsTime):
2012-10-15 14:53:18,267 monitorResponse=(-1) 25.26
2012-10-15 14:53:22,503 monitorResponse=(-1) 29.49
2012-10-15 14:53:24,554 monitorResponse=(0) 0.55
2012-10-15 14:54:03,687 monitorResponse=(-1) 6.68
2012-10-15 14:54:05,731 monitorResponse=(-1) 8.73
2012-10-15 14:54:07,769 monitorResponse=(-1) 10.77
2012-10-15 14:54:09,815 monitorResponse=(-1) 12.81
2012-10-15 14:54:14,021 monitorResponse=(-1) 17.02
2012-10-15 14:54:16,091 monitorResponse=(-1) 19.09
2012-10-15 14:54:18,126 monitorResponse=(-1) 21.12
2012-10-15 14:54:20,164 monitorResponse=(-1) 23.16
2012-10-15 14:54:24,343 monitorResponse=(-1) 27.34
2012-10-15 14:54:26,381 monitorResponse=(-1) 29.38
2012-10-15 14:54:28,575 monitorResponse=(-1) 31.57
2012-10-15 14:54:30,621 monitorResponse=(-1) 1.62
2012-10-15 14:54:35,379 monitorResponse=(-1) 6.38
2012-10-15 14:54:37,418 monitorResponse=(-1) 8.41
2012-10-15 14:54:39,469 monitorResponse=(-1) 10.47
2012-10-15 14:54:41,539 monitorResponse=(-1) 12.53
2012-10-15 14:54:46,007 monitorResponse=(-1) 17.00
2012-10-15 14:54:48,046 monitorResponse=(-1) 19.04
2012-10-15 14:54:50,085 monitorResponse=(-1) 21.08
2012-10-15 14:54:52,125 monitorResponse=(-1) 23.12
2012-10-15 14:54:56,366 monitorResponse=(-1) 27.36
2012-10-15 14:54:58,411 monitorResponse=(-1) 29.41
2012-10-15 14:55:00,584 monitorResponse=(0) 31.58
2012-10-15 14:55:37,436 monitorResponse=(-1) 4.43

where monitorResponse and statsTime are taken verbatim from the output;
monitorResponse: 0 if monitor is detected responsive, -1 otherwise;
statsTime: delta (secs) wrt last sample time


data extracted with

$ grep monitorResponse vdsm.log | getstats.py

where getstats.py is attached; full data attached as well.

A patch to mitigate the effects of this bouncing behaviour is being discussed at http://gerrit.ovirt.org/#/c/23138/

Comment 9 Michal Skrivanek 2014-08-26 08:52:36 UTC
we didn't have a proper discussion about the proper behavior, which issues should trigger non-resonsive state and which we can ignore
moving out of 3.5

Comment 10 Francesco Romani 2015-06-23 14:45:05 UTC
The new sampling code using bulk stats, included in 3.6.0, is expected to solve this issue among other general improvements.

Comment 12 Nisim Simsolo 2015-11-01 16:57:43 UTC
Verified: 
rhevm-3.6.0.1-0.1.el6 (3.6.0-17)
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM
2. kill -STOP <qemu pid>
Verify VM state does not change from up to up/down in libvirt, VDSM and engine logs (NotResponding state)
Verify no other ERROR occurs in the logs
Verify VM state remains up in webadmin (though actions cannot be done from console in this state)
3. kill -CONT <qemu pid>
Verify VM continues to run properly.
4. Repeat steps 1-3 few times.

Comment 13 Francesco Romani 2016-01-19 14:56:15 UTC
I don't think this BZ requires doc text. This fix should be transparent to the users.

Comment 15 errata-xmlrpc 2016-03-09 19:24:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.