Bug 866484 - vm state keeps changing from up to non-responsive between libvirt virDomainGetBlockInfo queries
vm state keeps changing from up to non-responsive between libvirt virDomainGe...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
unspecified
x86_64 Linux
medium Severity high
: ovirt-3.6.0-rc
: 3.6.0
Assigned To: Francesco Romani
Nisim Simsolo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-15 09:31 EDT by Dafna Ron
Modified: 2016-03-09 14:24 EST (History)
9 users (show)

See Also:
Fixed In Version: 3.6.0-4 alpha3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-09 14:24:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (1.01 MB, application/x-gzip)
2012-10-15 09:31 EDT, Dafna Ron
no flags Details
stats data and extraction script (5.57 KB, application/x-compressed-tar)
2014-01-13 09:34 EST, Francesco Romani
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 23138 master ABANDONED vm: pessimistic monitor response check Never

  None (edit)
Description Dafna Ron 2012-10-15 09:31:09 EDT
Created attachment 627427 [details]
logs

Description of problem:

I ran kill -STOP on qemu process and the vm running kept changing state in vdsm from UP to UP*
looking at libvirt log the vm seems to change to UP state between virDomainGetBlockInfo until results come back with cannot acquire state change lock each time. 

Version-Release number of selected component (if applicable):

vdsm-4.9.6-37.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.5.x86_64
qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

How reproducible:

100%

Steps to Reproduce:
1. run kill -STOP <qemu pid> 
2.
3.
  
Actual results:

vm state keeps changing between queries to UP and once the queries returns it changes to non-responding. 

Expected results:

vm should remain in non-responding state until states return correctly. 

Additional info:logs


2012-10-15 12:40:07.000+0000: 30250: debug : virDomainFree:2313 : dom=0x7f30a4000f80, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915)
2012-10-15 12:40:09.010+0000: 30248: debug : virDomainGetBlockInfo:7768 : dom=0x7f30b0000a10, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915), info=0x7f30c1678b20, flags=0
2012-10-15 12:40:39.000+0000: 30248: error : qemuDomainObjBeginJobInternal:776 : Timed out during operation: cannot acquire state change lock

engine: 

2012-10-15 15:20:21,999 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) [78ebe1b3] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:21:27,974 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-68) [2043e9e6] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:22:06,748 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-70) [444f5223] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:22:32,081 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-95) [2e9121ee] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:23:09,001 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-87) [57adc17b] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
2012-10-15 15:24:08,006 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-41) [244a2be8] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up
2012-10-15 15:24:42,704 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-9) [44f62617] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
Comment 1 RHEL Product and Program Management 2012-12-14 02:04:15 EST
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 3 Itamar Heim 2013-12-01 05:12:59 EST
still relevant?
Comment 5 Francesco Romani 2014-01-10 07:06:28 EST
cannot reproduce yet using:

vdsm 4.14.0 (from git master)
libvirt 1.0.5.8
qemu-kvm 1.4.2

using an iscsi storage built with
scsi-target-utils-1.0.38
iscsi-initiator-utils-6.2.0.873-17

vm stays 'Not Responding' all the time and do not bounces between states as reported.
Comment 7 Francesco Romani 2014-01-13 09:34:56 EST
Created attachment 849406 [details]
stats data and extraction script
Comment 8 Francesco Romani 2014-01-13 09:36:09 EST
Analysis of the log data suggests that the problem is that the monitor is sometimes mistakely detected as Up.

Example data (key: logdate logtime monitorResponse(value) statsTime):
2012-10-15 14:53:18,267 monitorResponse=(-1) 25.26
2012-10-15 14:53:22,503 monitorResponse=(-1) 29.49
2012-10-15 14:53:24,554 monitorResponse=(0) 0.55
2012-10-15 14:54:03,687 monitorResponse=(-1) 6.68
2012-10-15 14:54:05,731 monitorResponse=(-1) 8.73
2012-10-15 14:54:07,769 monitorResponse=(-1) 10.77
2012-10-15 14:54:09,815 monitorResponse=(-1) 12.81
2012-10-15 14:54:14,021 monitorResponse=(-1) 17.02
2012-10-15 14:54:16,091 monitorResponse=(-1) 19.09
2012-10-15 14:54:18,126 monitorResponse=(-1) 21.12
2012-10-15 14:54:20,164 monitorResponse=(-1) 23.16
2012-10-15 14:54:24,343 monitorResponse=(-1) 27.34
2012-10-15 14:54:26,381 monitorResponse=(-1) 29.38
2012-10-15 14:54:28,575 monitorResponse=(-1) 31.57
2012-10-15 14:54:30,621 monitorResponse=(-1) 1.62
2012-10-15 14:54:35,379 monitorResponse=(-1) 6.38
2012-10-15 14:54:37,418 monitorResponse=(-1) 8.41
2012-10-15 14:54:39,469 monitorResponse=(-1) 10.47
2012-10-15 14:54:41,539 monitorResponse=(-1) 12.53
2012-10-15 14:54:46,007 monitorResponse=(-1) 17.00
2012-10-15 14:54:48,046 monitorResponse=(-1) 19.04
2012-10-15 14:54:50,085 monitorResponse=(-1) 21.08
2012-10-15 14:54:52,125 monitorResponse=(-1) 23.12
2012-10-15 14:54:56,366 monitorResponse=(-1) 27.36
2012-10-15 14:54:58,411 monitorResponse=(-1) 29.41
2012-10-15 14:55:00,584 monitorResponse=(0) 31.58
2012-10-15 14:55:37,436 monitorResponse=(-1) 4.43

where monitorResponse and statsTime are taken verbatim from the output;
monitorResponse: 0 if monitor is detected responsive, -1 otherwise;
statsTime: delta (secs) wrt last sample time


data extracted with

$ grep monitorResponse vdsm.log | getstats.py

where getstats.py is attached; full data attached as well.

A patch to mitigate the effects of this bouncing behaviour is being discussed at http://gerrit.ovirt.org/#/c/23138/
Comment 9 Michal Skrivanek 2014-08-26 04:52:36 EDT
we didn't have a proper discussion about the proper behavior, which issues should trigger non-resonsive state and which we can ignore
moving out of 3.5
Comment 10 Francesco Romani 2015-06-23 10:45:05 EDT
The new sampling code using bulk stats, included in 3.6.0, is expected to solve this issue among other general improvements.
Comment 12 Nisim Simsolo 2015-11-01 11:57:43 EST
Verified: 
rhevm-3.6.0.1-0.1.el6 (3.6.0-17)
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM
2. kill -STOP <qemu pid>
Verify VM state does not change from up to up/down in libvirt, VDSM and engine logs (NotResponding state)
Verify no other ERROR occurs in the logs
Verify VM state remains up in webadmin (though actions cannot be done from console in this state)
3. kill -CONT <qemu pid>
Verify VM continues to run properly.
4. Repeat steps 1-3 few times.
Comment 13 Francesco Romani 2016-01-19 09:56:15 EST
I don't think this BZ requires doc text. This fix should be transparent to the users.
Comment 15 errata-xmlrpc 2016-03-09 14:24:00 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.