Bug 1054121
| Summary: | VM migration succeed though storage is blocked on src host. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ilanit Stein <istein> |
| Component: | ovirt-engine | Assignee: | Francesco Romani <fromani> |
| Status: | CLOSED NOTABUG | QA Contact: | Ilanit Stein <istein> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.3.0 | CC: | fromani, gklein, istein, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, yeylon |
| Target Milestone: | --- | ||
| Target Release: | 3.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | virt | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-06-24 16:04:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Ilanit Stein
2014-01-16 09:51:08 UTC
Created attachment 850968 [details]
source host logs (time 2 hours behind rhevm)
Created attachment 850970 [details]
dest host logs (time 2 hours behind rhevm)
if you really did block the storage connection the VM should have been paused due to EIO. If it was not it most likely means you didn't really write to disk (hence VM happily migrates) perhaps because of a cache, writing one line once a second can easily fit there. Can you retry with more heavy disk I/O? Anyway, first please attach correct/relevant vdsm logs from the source. Created attachment 852484 [details]
CORRECT source host vdsm.log
If, at is seems, is a libvirt/qemu issue, the best thing we can do in VDSM is properly report to engine that the migration has failed. From the logs, looks like VDSM mistakenly reports the migration succeded even after an abort request was issued. I've retested the flow in description, of iscsi storage, on another storage server, with a script of writing to disk, same, as mentioned in description, but without the delay (so writes to disk will occur more often). I got the same result: Migration succeeded. Created attachment 854362 [details]
More intenssive disk write , engine.log
Created attachment 854364 [details]
More intenssive disk write , source host (lilach-vdsb) logs.
host is behind engine in 2 hours.
Created attachment 854366 [details]
More intenssive disk write , destination host (lilach-vdsc) logs.
host is behind engine time in 2 hours.
In continue to comment #6: VM name: MIG Migration start time: 2014-Jan-23, 14:08 The VM had only these 2 events: 1. 2014-Jan-23, 14:10 VM MIG is down. Exit message: Lost connection with qemu process. 2. 2014-Jan-23, 14:08 Migration started (VM: MIG, Source: lilach-vdsb.tlv.redhat.com, Destination: lilach-vdsc.tlv.redhat.com, User: admin@internal). At 14:42, the VM is seen in UI in UP state, with up time of 40 Min The writes indeed are more often, but they can still be catched by the buffer/FS cache. A more stressfull I/O test would be something along those lines ================ #!/bin/sh while true; do dd if=/dev/zero of=/tmp/iotest.img bs=1k oflag=direct count=10240 done ================ of course /tmp/iotest.img could be replaced with any path. This will continously create a 10 MiB zero-filled file bypassing the buffer thanks to oflag=direct. the manpage of the dd command can be out of date. the `direct` flag is documented here: http://www.gnu.org/software/coreutils/manual/html_node/dd-invocation.html Ilanit, did you have a chance to try the suggestion from comment #11 to reproduce this? the core issue here seems to be the libvirt support for ABORT_ON_ERROR in migrations not properly working. Still investigating possible flaws/workarounds in VDSM. this bug won't fit into 3.5 release and is being deferred to a later release. If you deeply care about this bug and deserves to be re-evaluated please let me know (In reply to Michal Skrivanek from comment #12) > Ilanit, did you have a chance to try the suggestion from comment #11 to > reproduce this? All info to reproduce is provided in the bug. Ilanit, since the migration actually succeeded without any issues it means there was no blocking I/O access. We only provided suggestions how to make I/O operation to happen, but it all depends on your local setup, cache sizes, actual workload on the VM, etc…unless you make it stuck in I/O it's not a good enough setup to test the RHEV behavior Running shell script from comment #11, and repeating flow in description. Migration failed (vdsm.log): "Timed out during operation: cannot acquire state change lock". VM turned into pause state. After restoring storage connection in host, and vdsmd restart, the VM turned up, and was running on src host. Full error in vdsm.log: Thread-25::INFO::2014-09-08 13:20:30,004::migration::466::vm.Vm::(monitor_migration) vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Migration Progress: 745 seconds elapsed, 99% of data processed Thread-23::ERROR::2014-09-08 13:20:30,121::migration::259::vm.Vm::(run) vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 245, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 324, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/virt/vm.py", line 633, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock Is this behaviour as expected? Created attachment 935303 [details] Comment #17 src host vdsm log (In reply to Ilanit Stein from comment #17) > Running shell script from comment #11, and repeating flow in description. > Migration failed (vdsm.log): "Timed out during operation: cannot acquire > state change lock". VM turned into pause state. > After restoring storage connection in host, and vdsmd restart, > the VM turned up, and was running on src host. > > Full error in vdsm.log: > Thread-25::INFO::2014-09-08 > 13:20:30,004::migration::466::vm.Vm::(monitor_migration) > vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Migration Progress: 745 seconds > elapsed, 99% of data processed > Thread-23::ERROR::2014-09-08 13:20:30,121::migration::259::vm.Vm::(run) > vmId=`974b439c-09d9-4229-a989-692c3f7ce8c6`::Failed to migrate > Traceback (most recent call last): > File "/usr/share/vdsm/virt/migration.py", line 245, in run > self._startUnderlyingMigration(time.time()) > File "/usr/share/vdsm/virt/migration.py", line 324, in > _startUnderlyingMigration > None, maxBandwidth) > File "/usr/share/vdsm/virt/vm.py", line 633, in f > raise toe > TimeoutError: Timed out during operation: cannot acquire state change lock > > Is this behaviour as expected? This means libvirt can't access QEMU, most likely because the QEMU process is busy (or stuck) doing something else, most likely I/O. THis may well happen if storage is blocked on src host while QEMU is doing I/O, and this in turn is exactly what we were looking for. I'll be more precise once I read the logs. Many months later, with libvirt (tested u/s) 1.2.13 on Centos 7.1, qemu 2.1.2, a much stressfull test, and some luck, it worked. Steps to trigger the issue (please note that timing IS critical): 1. start VM 2. run I/O intensive tasks into said VM. I used 2.a: write-intensive load script as per https://bugzilla.redhat.com/show_bug.cgi?id=1054121#c11 2.b: read-intensive load #!/bin/sh while true; do cat /dev/vda | gzip -c > /dev/null done 3. begin migration 4. immediatly after migration started _and soruce VM status changed_, block storage using iptables on source After some time, we'll see on source logs: MiscFileReadException: Internal file read failure: ('/dev/c35580cb-4f22-4272-b236-8d8155ac3111/metadata',) Thread-27::INFO::2015-06-24 17:46:39,806::monitor::273::Storage.Monitor::(_notifyStatusChanges) Domain c35580cb-4f22-4272-b236-8d8155ac3111 became INVALID Thread-850::DEBUG::2015-06-24 17:46:39,807::misc::777::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Emitting event Thread-850::DEBUG::2015-06-24 17:46:39,808::misc::787::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Calling registered method `contEIOVms` Thread-850::DEBUG::2015-06-24 17:46:39,808::misc::797::Storage.Event.Storage.DomainMonitor.onDomainStateChange::(_emit) Event emitted libvirtEventLoop::INFO::2015-06-24 17:46:39,810::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:39,811::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError libvirtEventLoop::INFO::2015-06-24 17:46:39,811::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:39,812::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError libvirtEventLoop::DEBUG::2015-06-24 17:46:39,841::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 0 opaque None libvirtEventLoop::INFO::2015-06-24 17:46:39,841::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume libvirtEventLoop::INFO::2015-06-24 17:46:39,844::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:39,844::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError libvirtEventLoop::DEBUG::2015-06-24 17:46:39,845::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onSuspend libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:39,846::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError periodic.Executor-worker-4::WARNING::2015-06-24 17:46:39,882::sampling::473::sampling.StatsCache::(put) dropped stale old sample periodic.Executor-worker-4::DEBUG::2015-06-24 17:46:39,882::executor::176::Executor::(_run) Worker was discarded Thread-813::DEBUG::2015-06-24 17:46:40,120::migration::550::vm.Vm::(stop) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::stopping migration monitor thread libvirtEventLoop::DEBUG::2015-06-24 17:46:40,120::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 0 opaque None Thread-813::DEBUG::2015-06-24 17:46:40,121::migration::446::vm.Vm::(stop) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::stopping migration downtime thread libvirtEventLoop::INFO::2015-06-24 17:46:40,121::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume Thread-813::ERROR::2015-06-24 17:46:40,122::migration::202::vm.Vm::(_recover) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::operation failed: migration job: unexpectedly failed Thread-813::DEBUG::2015-06-24 17:46:40,123::stompreactor::372::jsonrpc.AsyncoreClient::(send) Sending response libvirtEventLoop::DEBUG::2015-06-24 17:46:40,125::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Resumed detail 1 opaque None libvirtEventLoop::INFO::2015-06-24 17:46:40,125::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU running: onResume libvirtEventLoop::INFO::2015-06-24 17:46:40,125::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:40,126::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError libvirtEventLoop::DEBUG::2015-06-24 17:46:40,126::vm::4336::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-06-24 17:46:40,127::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onSuspend libvirtEventLoop::INFO::2015-06-24 17:46:40,144::vm::3539::vm.Vm::(_onIOError) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::abnormal vm stop device virtio-disk0 error libvirtEventLoop::INFO::2015-06-24 17:46:40,145::vm::4819::vm.Vm::(_logGuestCpuStatus) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::CPU stopped: onIOError Thread-28::DEBUG::2015-06-24 17:46:40,291::fileSD::262::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd if=/rhev/data-center/mnt/192.168.1.20:_srv_virtstore_nfs_generic_iso/b0390d9e-2f1d-43c0-a56c-a20af032e934/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-28::DEBUG::2015-06-24 17:46:40,307::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n919 bytes (919 B) copied, 0.000684442 s, 1.3 MB/s\n'; <rc> = 0 Thread-813::ERROR::2015-06-24 17:46:43,136::migration::300::vm.Vm::(run) vmId=`ace61b23-f455-4fc7-8d59-d9c469657b2a`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 287, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 353, in _startUnderlyingMigration self._perform_migration(duri, muri) File "/usr/share/vdsm/virt/migration.py", line 392, in _perform_migration self._vm._dom.migrateToURI3(duri, params, flags) File "/usr/share/vdsm/virt/vm.py", line 206, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1766, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: operation failed: migration job: unexpectedly failed this confirms the flag worked on libvirt level. For 3.6.0 we require Centos >= 7.1 so no code changes are needed. Hence, closing the bug. |