Created attachment 627427 [details] logs Description of problem: I ran kill -STOP on qemu process and the vm running kept changing state in vdsm from UP to UP* looking at libvirt log the vm seems to change to UP state between virDomainGetBlockInfo until results come back with cannot acquire state change lock each time. Version-Release number of selected component (if applicable): vdsm-4.9.6-37.0.el6_3.x86_64 libvirt-0.9.10-21.el6_3.5.x86_64 qemu-img-rhev-0.12.1.2-2.295.el6_3.2.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 How reproducible: 100% Steps to Reproduce: 1. run kill -STOP <qemu pid> 2. 3. Actual results: vm state keeps changing between queries to UP and once the queries returns it changes to non-responding. Expected results: vm should remain in non-responding state until states return correctly. Additional info:logs 2012-10-15 12:40:07.000+0000: 30250: debug : virDomainFree:2313 : dom=0x7f30a4000f80, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915) 2012-10-15 12:40:09.010+0000: 30248: debug : virDomainGetBlockInfo:7768 : dom=0x7f30b0000a10, (VM: name=TESTI-01, uuid=4a89ef70-86c5-4f63-93d6-4432c0176915), info=0x7f30c1678b20, flags=0 2012-10-15 12:40:39.000+0000: 30248: error : qemuDomainObjBeginJobInternal:776 : Timed out during operation: cannot acquire state change lock engine: 2012-10-15 15:20:21,999 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-96) [78ebe1b3] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding 2012-10-15 15:21:27,974 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-68) [2043e9e6] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up 2012-10-15 15:22:06,748 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-70) [444f5223] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding 2012-10-15 15:22:32,081 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-95) [2e9121ee] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up 2012-10-15 15:23:09,001 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-87) [57adc17b] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding 2012-10-15 15:24:08,006 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-41) [244a2be8] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from NotResponding --> Up 2012-10-15 15:24:42,704 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-9) [44f62617] VM TESTI-01 4a89ef70-86c5-4f63-93d6-4432c0176915 moved from Up --> NotResponding
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
still relevant?
cannot reproduce yet using: vdsm 4.14.0 (from git master) libvirt 1.0.5.8 qemu-kvm 1.4.2 using an iscsi storage built with scsi-target-utils-1.0.38 iscsi-initiator-utils-6.2.0.873-17 vm stays 'Not Responding' all the time and do not bounces between states as reported.
Created attachment 849406 [details] stats data and extraction script
Analysis of the log data suggests that the problem is that the monitor is sometimes mistakely detected as Up. Example data (key: logdate logtime monitorResponse(value) statsTime): 2012-10-15 14:53:18,267 monitorResponse=(-1) 25.26 2012-10-15 14:53:22,503 monitorResponse=(-1) 29.49 2012-10-15 14:53:24,554 monitorResponse=(0) 0.55 2012-10-15 14:54:03,687 monitorResponse=(-1) 6.68 2012-10-15 14:54:05,731 monitorResponse=(-1) 8.73 2012-10-15 14:54:07,769 monitorResponse=(-1) 10.77 2012-10-15 14:54:09,815 monitorResponse=(-1) 12.81 2012-10-15 14:54:14,021 monitorResponse=(-1) 17.02 2012-10-15 14:54:16,091 monitorResponse=(-1) 19.09 2012-10-15 14:54:18,126 monitorResponse=(-1) 21.12 2012-10-15 14:54:20,164 monitorResponse=(-1) 23.16 2012-10-15 14:54:24,343 monitorResponse=(-1) 27.34 2012-10-15 14:54:26,381 monitorResponse=(-1) 29.38 2012-10-15 14:54:28,575 monitorResponse=(-1) 31.57 2012-10-15 14:54:30,621 monitorResponse=(-1) 1.62 2012-10-15 14:54:35,379 monitorResponse=(-1) 6.38 2012-10-15 14:54:37,418 monitorResponse=(-1) 8.41 2012-10-15 14:54:39,469 monitorResponse=(-1) 10.47 2012-10-15 14:54:41,539 monitorResponse=(-1) 12.53 2012-10-15 14:54:46,007 monitorResponse=(-1) 17.00 2012-10-15 14:54:48,046 monitorResponse=(-1) 19.04 2012-10-15 14:54:50,085 monitorResponse=(-1) 21.08 2012-10-15 14:54:52,125 monitorResponse=(-1) 23.12 2012-10-15 14:54:56,366 monitorResponse=(-1) 27.36 2012-10-15 14:54:58,411 monitorResponse=(-1) 29.41 2012-10-15 14:55:00,584 monitorResponse=(0) 31.58 2012-10-15 14:55:37,436 monitorResponse=(-1) 4.43 where monitorResponse and statsTime are taken verbatim from the output; monitorResponse: 0 if monitor is detected responsive, -1 otherwise; statsTime: delta (secs) wrt last sample time data extracted with $ grep monitorResponse vdsm.log | getstats.py where getstats.py is attached; full data attached as well. A patch to mitigate the effects of this bouncing behaviour is being discussed at http://gerrit.ovirt.org/#/c/23138/
we didn't have a proper discussion about the proper behavior, which issues should trigger non-resonsive state and which we can ignore moving out of 3.5
The new sampling code using bulk stats, included in 3.6.0, is expected to solve this issue among other general improvements.
Verified: rhevm-3.6.0.1-0.1.el6 (3.6.0-17) vdsm-4.17.10-5.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 Scenario: 1. Run VM 2. kill -STOP <qemu pid> Verify VM state does not change from up to up/down in libvirt, VDSM and engine logs (NotResponding state) Verify no other ERROR occurs in the logs Verify VM state remains up in webadmin (though actions cannot be done from console in this state) 3. kill -CONT <qemu pid> Verify VM continues to run properly. 4. Repeat steps 1-3 few times.
I don't think this BZ requires doc text. This fix should be transparent to the users.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html