Bug 1003588 - Paused VM not unpaused when vdsm is starting and storage domain is valid
Summary: Paused VM not unpaused when vdsm is starting and storage domain is valid
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.3.0
Assignee: Nir Soffer
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: 1036358 3.3snap4
TreeView+ depends on / blocked
 
Reported: 2013-09-02 12:47 UTC by Leonid Natapov
Modified: 2016-02-10 19:25 UTC (History)
11 users (show)

Fixed In Version: is26
Doc Type: Bug Fix
Doc Text:
The domain monitor thread did not provide notifications when going from an unknown state to a known state. The thread would start with a bogus valid state, before checking the domain. If the first domain check found the domain as valid, no notification was emitted. If there were paused virtual machines while connecting to the pool, they were not unpaused as they should be. Now, the monitor thread detects the first state change and emit notifiction with the domain state. Domain state changes are now registered before the monitor threads are started. Paused virtual machines are now unpaused when VDSM starts, and the domain is valid.
Clone Of:
: 1036358 (view as bug list)
Environment:
Last Closed: 2014-01-21 16:14:37 UTC
oVirt Team: Storage
Target Upstream Version:
scohen: Triaged+


Attachments (Terms of Use)
vdsm log (3.35 MB, text/plain)
2013-09-02 12:47 UTC, Leonid Natapov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC
oVirt gerrit 21649 0 'None' MERGED vm: Fix vm unpausing during recovery 2021-01-15 17:21:03 UTC
oVirt gerrit 21782 0 'None' MERGED vm: Fix vm unpausing during recovery 2021-01-15 17:21:03 UTC

Description Leonid Natapov 2013-09-02 12:47:05 UTC
Created attachment 792817 [details]
vdsm log

Description of problem:
VM is not being automatically unpaused after I/O error. ISCSI environment. 

VM that goes to "paused" state because of I/O problems should be automatically unpaused when the I/O problem solved. It doesn't happen on ISCSI environment.



Version-Release number of selected component (if applicable):
is11. vdsm-4.12.0-72.git287bb7e.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Create VM with thin provisioned disk on ISCSI storage.
2.Install OS on the VM.
3.connect to VM start to write data to its HD (make big file)
4.Block connection to SD from the SPM host while data is being writed to VMs HD.
5.See that VM goes to "paused' state.
6.Resume the connection from SPM host to Storage Domain.

Actual results:
VM stays in "paused" mode.

Expected results:
Once connection to SD resumed ,VM turns from "paused" to "active"


Additional info:
vdsm log attached.

Comment 1 Elad 2013-09-24 14:36:34 UTC
This feature works for me. I checked on is16

Comment 2 Ayal Baron 2013-10-09 12:09:22 UTC
Auto resume depends on domain monitoring (failed domain coming back up causes VMs to be unpaused). The situation here looks like the domain monitoring for this domain stopped for some reason (so naturally the VM wouldn't be resumed.

Comment 3 Katarzyna Jachim 2013-10-16 10:05:10 UTC
From what I have noticed:
* automatic unpausing doesn't work if we have an environment with only one host and one storage domain and we block connection between them - in such case after unblocking the connection DC is eventually up, but VM stays paused
* if we have a DC with one storage domain and two hosts and VM is running on non-SPM, unpausing work fine
* if we have a DC with one storage domain and two hosts and VM is running on SPM, then automatic unpausing doesn't work

Comment 4 Federico Simoncelli 2013-11-15 11:10:56 UTC
Relevant log lines:

MainThread::INFO::2013-09-02 15:06:10,303::vdsm::101::vds::(run) (PID: 1957) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev purple-vds3.qa.lab.tlv.redhat.com (2.6.32-358.11.1.el6.x86_64)

MainThread::INFO::2013-09-02 15:06:13,081::logUtils::44::dispatcher::(wrapper) Run and protect: registerDomainStateChangeCallback(callbackFunc=<bound method clientIF.contEIOVms of <clientIF.clientIF instance at 0x7f0f0d2c8128>>)

Thread-22::INFO::2013-09-02 15:06:34,437::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='5849b030-626e-47cb-ad90-3ce782d831b3', hostID=1, scsiKey='5849b030-626e-47cb-ad90-3ce782d831b3', msdUUID='df72dcde-837d-4a82-8427-a7e77d2e239e', masterVersion=1, options=None)

libvirtEventLoop::INFO::2013-09-02 15:06:34,563::vm::4142::vm.Vm::(_onAbnormalStop) vmId=`de8f50ba-de90-45c4-84c8-66cb5bb2ffd1`::abnormal vm stop device virtio-disk0 error eio

Thread-28::DEBUG::2013-09-02 15:06:42,806::BindingXMLRPC::981::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'guestFQDN': '', 'pid': '29516', 'displayIp': '0', 'displayPort': u'5900', 'session': 'Unknown', 'displaySecurePort': u'5901', 'timeOffset': '0', 'hash': '1654317978148629509', 'balloonInfo': {'balloon_max': '1048576', 'balloon_target': '1048576', 'balloon_cur': '1048576', 'balloon_min': '1048576'}, 'pauseCode': 'EIO', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet0': {'macAddr': '00:1a:4a:0e:00:a7', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet0'}}, 'vmId': 'de8f50ba-de90-45c4-84c8-66cb5bb2ffd1', 'displayType': 'qxl', 'cpuUser': '-0.21', 'disks': {u'vda': {'truesize': '1073741824', 'apparentsize': '1073741824', 'imageID': '29b702bf-4830-4856-b800-fabba21b6361'}, u'hdc': {'truesize': '0', 'apparentsize': '0'}}, 'monitorResponse': '0', 'statsAge': '1.81', 'elapsedTime': '1093', 'vmType': 'kvm', 'cpuSys': '0.77', 'appsList': [], 'guestIPs': ''}]}

Thread-170::DEBUG::2013-09-02 15:10:07,880::domainMonitor::151::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for df72dcde-837d-4a82-8427-a7e77d2e239e

MainThread::INFO::2013-09-02 15:28:52,873::vdsm::101::vds::(run) (PID: 15010) I am the actual vdsm 4.12.0-72.git287bb7e.el6ev purple-vds3.qa.lab.tlv.redhat.com (2.6.32-358.11.1.el6.x86_64)

MainThread::INFO::2013-09-02 15:28:54,055::logUtils::44::dispatcher::(wrapper) Run and protect: registerDomainStateChangeCallback(callbackFunc=<bound method clientIF.contEIOVms of <clientIF.clientIF instance at 0x7f8827ea5128>>)

Thread-86::INFO::2013-09-02 15:30:40,432::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='5849b030-626e-47cb-ad90-3ce782d831b3', hostID=1, scsiKey='5849b030-626e-47cb-ad90-3ce782d831b3', msdUUID='df72dcde-837d-4a82-8427-a7e77d2e239e', masterVersion=1, options=None)

Thread-89::DEBUG::2013-09-02 15:30:48,200::domainMonitor::151::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for df72dcde-837d-4a82-8427-a7e77d2e239e

Thread-86::INFO::2013-09-02 15:30:49,698::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True


It seems to me that when the host is the SPM and the blocked domain is the master what happens is:

- VM is up
- connection to master domain is blocked
- vdsm is restarted (in order to release the SPM role)
- domain state change callback is registered (but the pool is not connected)
- connectStoragePool keeps failing
- the connection to the master domain is unblocked
- connectStoragePool succeeds but there is event about the domains being accessible

I think this goes back to the old discussion about having a "domain is up" event also on connection.

With regard to comment 3, I'd add another test:

- one host only (SPM), two storage domains, block the connection to the non-master domain

In such case the resume should work with the current code.

I also think that this issue is not limited to ISCSI.

Comment 5 Elad 2013-12-15 14:56:02 UTC
Checked on both NFS and iSCSI. After connectivity to master domain is resumed, VM get started from paused state back to up.

Verified using is26

Comment 6 errata-xmlrpc 2014-01-21 16:14:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.