Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1054121

Summary: VM migration succeed though storage is blocked on src host.
Product: Red Hat Enterprise Virtualization Manager Reporter: Ilanit Stein <istein>
Component: ovirt-engineAssignee: Francesco Romani <fromani>
Status: CLOSED NOTABUG QA Contact: Ilanit Stein <istein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: fromani, gklein, istein, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, yeylon
Target Milestone: ---   
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-06-24 16:04:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
source host logs (time 2 hours behind rhevm)
none
dest host logs (time 2 hours behind rhevm)
none
CORRECT source host vdsm.log
none
More intenssive disk write , engine.log
none
More intenssive disk write , source host (lilach-vdsb) logs.
none
More intenssive disk write , destination host (lilach-vdsc) logs.
none
Comment #17 src host vdsm log none

Description Ilanit Stein 2014-01-16 09:51:08 UTC
Created attachment 850967 [details]
engine.log

Description of problem:

Block storage in source host, while migrate a rhel VM.
Events of "VM MIG is not responding." appear.
After ~15 min, migration has completed, though src host is non-operational.

Assuming reason for migration complete was that there were no write to disk,
I added in the VM, disk write in endless loop, but sill,
the migration eventually succeed, though it took more time (~40 min).
In VM console, the writes to disk continued to run after migration completed.

Version-Release number of selected component (if applicable):
rhevm- is31
libvirt - libvirt-0.10.2-29.el6_5.2.x86_64

How reproducible:
Problem was reproduced twice in a role on the same setup.  

Steps to Reproduce:
1. Have to hosts (mine ere rhev-hs)
2. Run a VM with rhel OS, on 1 host
3. Migrate VM
4. Right after, block storage on source host by:
iptables -A OUTPUT -d <storage server>  -j REJECT
At this stage VM migration seem stuck (also reported as such in vdsm log)

Expected results:
1. Migration should not have completed, while storage is blocked on source host.
2. Migration should fail - there is a separate bug for that, bug 1053561,
which also related to bug 1045833.

Additional info:
In the logs, Migration of VM "MIG", from Source: lilach-vdsb to Destination: lilach-vdsa,
1. Without disk write start @  2014-Jan-15, 15:55 
2. With disk write start @ 2014-Jan-15, 17:17

Shell script used to write to disk:
================
#!/bin/sh

while true; 
do
sleep 1;
#For test, just writing time
echo $(date) >> "infinite.txt";
done
================

Comment 1 Ilanit Stein 2014-01-16 09:52:20 UTC
Created attachment 850968 [details]
source host logs (time 2 hours behind rhevm)

Comment 2 Ilanit Stein 2014-01-16 09:52:56 UTC
Created attachment 850970 [details]
dest host logs (time 2 hours behind rhevm)

Comment 3 Michal Skrivanek 2014-01-17 11:55:36 UTC
if you really did block the storage connection the VM should have been paused due to EIO. If it was not it most likely means you didn't really write to disk (hence VM happily migrates) perhaps because of a cache, writing one line once a second can easily fit there. 
Can you retry with more heavy disk I/O?

Anyway, first please attach correct/relevant vdsm logs from the source.

Comment 4 Ilanit Stein 2014-01-19 21:07:57 UTC
Created attachment 852484 [details]
CORRECT source host vdsm.log

Comment 5 Francesco Romani 2014-01-20 15:10:46 UTC
If, at is seems, is a libvirt/qemu issue, the best thing we can do in VDSM is properly report to engine that the migration has failed.
From the logs, looks like VDSM mistakenly reports the migration succeded even after an abort request was issued.

Comment 6 Ilanit Stein 2014-01-23 12:22:12 UTC
I've retested the flow in description, of iscsi storage, on another storage server,
with a script of writing to disk, same, as mentioned in description, but without the delay (so writes to disk will occur more often).

I got the same result: Migration succeeded.

Comment 7 Ilanit Stein 2014-01-23 12:36:21 UTC
Created attachment 854362 [details]
More intenssive disk write , engine.log

Comment 8 Ilanit Stein 2014-01-23 12:38:13 UTC
Created attachment 854364 [details]
More intenssive disk write , source host (lilach-vdsb) logs.

host is behind engine in 2 hours.

Comment 9 Ilanit Stein 2014-01-23 12:40:37 UTC
Created attachment 854366 [details]
More intenssive disk write , destination host (lilach-vdsc) logs.

host is behind engine time in 2 hours.

Comment 10 Ilanit Stein 2014-01-23 12:43:58 UTC
In continue to comment #6:
VM name: MIG
Migration start time:   2014-Jan-23, 14:08	


The VM had only these 2 events:
1. 2014-Jan-23, 14:10
VM MIG is down. Exit message: Lost connection with qemu process.
2. 2014-Jan-23, 14:08
Migration started (VM: MIG, Source: lilach-vdsb.tlv.redhat.com, Destination: lilach-vdsc.tlv.redhat.com, User: admin@internal).

At 14:42, the VM is seen in UI in UP state, with up time of 40 Min

Comment 11 Francesco Romani 2014-01-27 13:41:01 UTC
The writes indeed are more often, but they can still be catched by the buffer/FS cache.

A more stressfull I/O test would be something along those lines

================
#!/bin/sh

while true; do
	dd if=/dev/zero of=/tmp/iotest.img bs=1k oflag=direct count=10240
done
================

of course /tmp/iotest.img could be replaced with any path.
This will continously create a 10 MiB zero-filled file bypassing the buffer
thanks to oflag=direct.

the manpage of the dd command can be out of date. the `direct` flag is documented here:
http://www.gnu.org/software/coreutils/manual/html_node/dd-invocation.html

Comment 12 Michal Skrivanek 2014-03-25 10:33:03 UTC
Ilanit, did you have a chance to try the suggestion from comment #11 to reproduce this?

Comment 13 Francesco Romani 2014-07-17 13:37:24 UTC
the core issue here seems to be the libvirt support for ABORT_ON_ERROR in migrations not properly working. Still investigating possible flaws/workarounds in VDSM.

Comment 14 Michal Skrivanek 2014-08-22 13:01:29 UTC
this bug won't fit into 3.5 release and is being deferred to a later release. If you deeply care about this bug and deserves to be re-evaluated please let me know

Comment 15 Ilanit Stein 2014-08-27 08:02:11 UTC
(In reply to Michal Skrivanek from comment #12)
> Ilanit, did you have a chance to try the suggestion from comment #11 to
> reproduce this?

All info to reproduce is provided in the bug.

Comment 16 Michal Skrivanek 2014-08-27 08:14:18 UTC
Ilanit, since the migration actually succeeded without any issues it means there was no blocking I/O access.
We only provided suggestions how to make I/O operation to happen, but it all depends on your local setup, cache sizes, actual workload on the VM, etc…unless you make it stuck in I/O it's not a good enough setup to test the RHEV behavior

Comment 17 Ilanit Stein 2014-09-08 10:40:07 UTC
Running shell script from comment #11, and repeating flow in description.
Migration failed (vdsm.log): "Timed out during operation: cannot acquire state change lock". VM turned into pause state.
After restoring storage connection in host, and vdsmd restart,
the VM turned up, and was running on src host.

Full error in vdsm.log:
Thread-25::INFO::2014-09-08 13:20:30,004::migration::466::vm.Vm::(monitor_migration) vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Migration Progress: 745 seconds elapsed, 99% of data processed
Thread-23::ERROR::2014-09-08 13:20:30,121::migration::259::vm.Vm::(run) vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 245, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 324, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/virt/vm.py", line 633, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock

Is this behaviour as expected?

Comment 18 Ilanit Stein 2014-09-08 10:43:20 UTC
Created attachment 935303 [details]
Comment #17 src host vdsm log

Comment 19 Francesco Romani 2014-09-10 20:38:23 UTC
(In reply to Ilanit Stein from comment #17)
> Running shell script from comment #11, and repeating flow in description.
> Migration failed (vdsm.log): "Timed out during operation: cannot acquire
> state change lock". VM turned into pause state.
> After restoring storage connection in host, and vdsmd restart,
> the VM turned up, and was running on src host.
> 
> Full error in vdsm.log:
> Thread-25::INFO::2014-09-08
> 13:20:30,004::migration::466::vm.Vm::(monitor_migration)
> vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Migration Progress: 745 seconds
> elapsed, 99% of data processed
> Thread-23::ERROR::2014-09-08 13:20:30,121::migration::259::vm.Vm::(run)
> vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Failed to migrate
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/migration.py", line 245, in run
>     self._startUnderlyingMigration(time.time())
>   File "/usr/share/vdsm/virt/migration.py", line 324, in
> _startUnderlyingMigration
>     None, maxBandwidth)
>   File "/usr/share/vdsm/virt/vm.py", line 633, in f
>     raise toe
> TimeoutError: Timed out during operation: cannot acquire state change lock
> 
> Is this behaviour as expected?

This means libvirt can't access QEMU, most likely because the QEMU process is busy (or stuck) doing something else, most likely I/O.
THis may well happen if storage is blocked on src host while QEMU is doing I/O, and this in turn is exactly what we were looking for.

I'll be more precise once I read the logs.

Comment 20 Francesco Romani 2015-06-24 16:04:26 UTC
Many months later, with libvirt (tested u/s) 1.2.13 on Centos 7.1, qemu 2.1.2, a much stressfull test, and some luck, it worked.

Steps to trigger the issue (please note that timing IS critical):

1. start VM
2. run I/O intensive tasks into said VM. I used
2.a: write-intensive load
script as per https://bugzilla.redhat.com/show_bug.cgi?id=1054121#c11
2.b: read-intensive load
#!/bin/sh

while true; do
    cat /dev/vda | gzip -c > /dev/null
done
3. begin migration
4. immediatly after migration started _and soruce VM status changed_,
block storage using iptables on source

After some time, we'll see on source logs:

MiscFileReadException: Internal file read failure: ('/dev/c35580cb-4f22-4272-b236-8d8155ac3111/metadata',)
Thread-27::INFO::2015-06-24 17:46:39,806::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain c35580cb-4f22-4272-b236-8d8155ac3111 became INVALID
Thread-850::DEBUG::2015-06-24 17:46:39,807::misc::777::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Emitting event
Thread-850::DEBUG::2015-06-24 17:46:39,808::misc::787::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms`
Thread-850::DEBUG::2015-06-24 17:46:39,808::misc::797::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Event emitted
libvirtEventLoop::INFO::2015-06-24 17:46:39,810::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:39,811::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
libvirtEventLoop::INFO::2015-06-24 17:46:39,811::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:39,812::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
libvirtEventLoop::DEBUG::2015-06-24 17:46:39,841::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 0 opaque None
libvirtEventLoop::INFO::2015-06-24 17:46:39,841::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume
libvirtEventLoop::INFO::2015-06-24 17:46:39,844::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:39,844::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
libvirtEventLoop::DEBUG::2015-06-24 17:46:39,845::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Suspended detail 2 opaque None
libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onSuspend
libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
periodic.Executor-worker-4::WARNING::2015-06-24 17:46:39,882::sampling::473::sampling.StatsCache::(put) dropped stale old sample
periodic.Executor-worker-4::DEBUG::2015-06-24 17:46:39,882::executor::176::Executor::(_run) Worker was discarded
Thread-813::DEBUG::2015-06-24 17:46:40,120::migration::550::vm.Vm::(stop) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::stopping migration monitor thread
libvirtEventLoop::DEBUG::2015-06-24 17:46:40,120::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 0 opaque None
Thread-813::DEBUG::2015-06-24 17:46:40,121::migration::446::vm.Vm::(stop) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::stopping migration downtime thread
libvirtEventLoop::INFO::2015-06-24 17:46:40,121::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume
Thread-813::ERROR::2015-06-24 17:46:40,122::migration::202::vm.Vm::(_recover) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::operation failed: migration job: unexpectedly failed
Thread-813::DEBUG::2015-06-24 17:46:40,123::stompreactor::372::jsonrpc.AsyncoreClient::(send) Sending response
libvirtEventLoop::DEBUG::2015-06-24 17:46:40,125::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 1 opaque None
libvirtEventLoop::INFO::2015-06-24 17:46:40,125::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume
libvirtEventLoop::INFO::2015-06-24 17:46:40,125::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:40,126::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
libvirtEventLoop::DEBUG::2015-06-24 17:46:40,126::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Suspended detail 2 opaque None
libvirtEventLoop::INFO::2015-06-24 17:46:40,127::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onSuspend
libvirtEventLoop::INFO::2015-06-24 17:46:40,144::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error
libvirtEventLoop::INFO::2015-06-24 17:46:40,145::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError
Thread-28::DEBUG::2015-06-24 17:46:40,291::fileSD::262::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/192.168.1.20:_srv_virtstore_nfs_generic_iso/b0390d9e-2f1d-43c0-a56c-a20af032e934/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-28::DEBUG::2015-06-24 17:46:40,307::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n919 bytes (919 B) copied, 0.000684442 s, 1.3 MB/s\n'; <rc> = 0
Thread-813::ERROR::2015-06-24 17:46:43,136::migration::300::vm.Vm::(run) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 287, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 353, in _startUnderlyingMigration
    self._perform_migration(duri, muri)
  File "/usr/share/vdsm/virt/migration.py", line 392, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/share/vdsm/virt/vm.py", line 206, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1766, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: operation failed: migration job: unexpectedly failed


this confirms the flag worked on libvirt level.

For 3.6.0 we require Centos >= 7.1 so no code changes are needed. Hence, closing the bug.