Bug 1277255 - Migration fails to migrate.
Summary: Migration fails to migrate.
Keywords:
Status: CLOSED DUPLICATE of bug 1247237
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-3.6.1
: 3.6.1
Assignee: Dan Kenigsberg
QA Contact: Israel Pinto
URL:
Whiteboard: virt
Depends On: migration_improvements
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-02 19:59 UTC by Bill Sanford
Modified: 2016-02-10 19:23 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-05 19:18:17 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
VDSM log for Windows XP. (9.88 MB, text/plain)
2015-11-03 10:44 UTC, Andrei Stepanov
no flags Details
test1: qemu log for hyper02 (48.72 KB, text/plain)
2015-11-03 11:41 UTC, Andrei Stepanov
no flags Details
test1: vdsm log for hyper02 (348.34 KB, text/plain)
2015-11-03 11:42 UTC, Andrei Stepanov
no flags Details
test1: libvirt log for hyper02 (827 bytes, text/plain)
2015-11-03 11:42 UTC, Andrei Stepanov
no flags Details
test1: qemu log for hyper01 (45.72 KB, text/plain)
2015-11-03 11:43 UTC, Andrei Stepanov
no flags Details
test1: vdsm log for hyper01 (775.89 KB, text/plain)
2015-11-03 11:44 UTC, Andrei Stepanov
no flags Details
test1: libvirt log for hyper01 (2.09 KB, text/plain)
2015-11-03 11:44 UTC, Andrei Stepanov
no flags Details
1/2 hosts of migration failure (458.54 KB, application/x-gzip)
2015-11-03 18:48 UTC, Bill Sanford
no flags Details
2/2 hosts of migration failure (458.54 KB, application/x-gzip)
2015-11-03 18:48 UTC, Bill Sanford
no flags Details
manual_migration_engine_log (167.62 KB, application/zip)
2015-11-04 11:47 UTC, Israel Pinto
no flags Details
manual_migration_host_1_log (1.27 MB, application/zip)
2015-11-04 11:48 UTC, Israel Pinto
no flags Details
manual_migration_host_2_log (1.30 MB, application/zip)
2015-11-04 11:49 UTC, Israel Pinto
no flags Details
memory_hotplug_engine_log (185.93 KB, application/zip)
2015-11-04 11:51 UTC, Israel Pinto
no flags Details
memory_hotplug_host_1_log (2.03 MB, application/zip)
2015-11-04 11:52 UTC, Israel Pinto
no flags Details
memory_hotplug_host_2_log (1.68 MB, application/zip)
2015-11-04 11:52 UTC, Israel Pinto
no flags Details
with_spice_video_engine_log (160.02 KB, application/zip)
2015-11-04 11:53 UTC, Israel Pinto
no flags Details
with_spice_video_host_1_log (611.79 KB, application/zip)
2015-11-04 11:54 UTC, Israel Pinto
no flags Details
with_spice_video_host_2_log (1.30 MB, application/zip)
2015-11-04 11:54 UTC, Israel Pinto
no flags Details

Description Bill Sanford 2015-11-02 19:59:22 UTC
Description of problem:
Migration fails to migrate. The RHEV-M install was working and ready to migrate.

Version-Release number of selected component (if applicable):

RHEL-6.7-20150811.1 & RHEV-M 3.6.0-17 -Engine
RHEL-7.2-20151027.3 - Client, Host
Windows 7x86_64 SP1 - Guest

virt-viewer-2.0-7.el6.x86_64
Virt-viewer Method: Firefox-*.vv file

How reproducible:
80%

Steps to Reproduce:
1.
2.
3.

Actual results:
Migration fails

Expected results:
Migration works

Additional info:

I did a tail -f vdsm.log and the output is:

http://pastebin.test.redhat.com/324332

The only fail I saw was:

Thread-102474::ERROR::2015-11-02 11:43:14,739::migration::310::virt.vm::(run) vmId=`96e3c835-7fbb-4d36-8150-4c399aaff1a3`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration
    self._perform_migration(duri, muri)
  File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/share/vdsm/virt/virdomain.py", line 68, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: operation aborted: migration job: canceled by client
mailbox.SPMMonitor::DEBUG::2015-11-02 11:43:15,458::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/00000001-0001-0001-0001-000000000085/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)

Comment 2 Bill Sanford 2015-11-02 20:08:11 UTC
During the migration, I was playing a video from youtube, either using HTML5 or Flash player, with same results.

Comment 4 Tomas Jamrisko 2015-11-03 09:33:00 UTC
Bill, can you take a look at this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1251541 and maybe consider closing this as a duplicate of it? 

Besides, vdsm sets a rather restrictive bandwidth limit for migration, so could you make sure, that your [vars] section in /etc/vdsm/vdsm.conf on -both- hosts looks like this:

[vars]
ssl = true
migration_max_bandwidth = 0

and try again, migration works perfectly fine in our setup.

And both bugs would be duplicates of this https://bugzilla.redhat.com/show_bug.cgi?id=1247237

Comment 5 Andrei Stepanov 2015-11-03 10:42:28 UTC
I can confirm that I frequently see a very similar backtrace.

It is not always reproducible, it is about 1 from 10.

In my case VM got killed.

Thread-172314::ERROR::2015-11-03 10:32:09,424::migration::310::virt.vm::(run) vmId=`5fc3ede4-7bc7-4bca-a657-4ccb537e2e03`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration
    self._perform_migration(duri, muri)
  File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/share/vdsm/virt/virdomain.py", line 68, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '5fc3ede4-7bc7-4bca-a657-4ccb537e2e03' (WIN-XP)

Comment 6 Andrei Stepanov 2015-11-03 10:44:16 UTC
Created attachment 1088912 [details]
VDSM log for Windows XP.

Comment 8 Andrei Stepanov 2015-11-03 11:41:14 UTC
Created attachment 1088931 [details]
test1: qemu log for hyper02

Comment 9 Andrei Stepanov 2015-11-03 11:42:06 UTC
Created attachment 1088932 [details]
test1: vdsm log for hyper02

Comment 10 Andrei Stepanov 2015-11-03 11:42:39 UTC
Created attachment 1088933 [details]
test1: libvirt log for hyper02

Comment 11 Andrei Stepanov 2015-11-03 11:43:15 UTC
Created attachment 1088934 [details]
test1: qemu log for hyper01

Comment 12 Andrei Stepanov 2015-11-03 11:44:08 UTC
Created attachment 1088935 [details]
test1: vdsm log for hyper01

Comment 13 Andrei Stepanov 2015-11-03 11:44:44 UTC
Created attachment 1088936 [details]
test1: libvirt log for hyper01

Comment 14 Yaniv Kaul 2015-11-03 11:51:25 UTC
Looking at libvirt logs, could it be related to bug 1261430 ?
(deducing from the log line: "Unsupported migration cookie feature memory-hotplug" )

Comment 15 Andrei Stepanov 2015-11-03 12:28:58 UTC
I have suspicion that my case is related to

https://bugzilla.redhat.com/show_bug.cgi?id=1277471

What is necessary to drop down VM is to migrate it few times with attached smartcard.

Comment 16 Bill Sanford 2015-11-03 18:48:08 UTC
Created attachment 1089167 [details]
1/2 hosts of migration failure

Comment 17 Bill Sanford 2015-11-03 18:48:57 UTC
Created attachment 1089168 [details]
2/2 hosts of migration failure

Comment 18 Israel Pinto 2015-11-04 11:47:05 UTC
I manage to reproduce this bug,  See below summary of what i tested.
I think the failures release to the memory size of the VM, since we run with network card speed of 1000.
Maybe if we test with 10G card we will not see this problem.

1. VM: Fedora 22 with 10GB RAM
Case (run each case 5 times)
Manual migration (automatic host) - 2 failures of 5 runs 
Migration with Spice opened with video running	3 failures of 5 runs

2. VM: Fedora 22 with 7GB RAM
Case (run each case 3 times)	
Manual migration (automatic host)-Pass
Manual migration (choose host)-Pass
Maintenance-Pass
Maintenance host SPM-Pass
Migration after memory hot plug- 1 failure of 3 runs
Note: It  happen probably with we enlarge memory
Migration with Spice opened with running video - 2 failures of 3 runs

Set up (nested hosts):
Engine:  3.6.0.2-.0.1
Hosts:
OS Version: RHEL - 7.2 - 9.el7
Kernel Version: 3.10.0 - 327.el7.x86_64
KVM Version:2.3.0 - 31.el7
LIBVIRT Version: libvirt-1.2.17-13.el7
VDSM Version: vdsm-4.17.10-5.el7ev
SPICE Version: 0.12.4 - 15.el7

attaching logs

Comment 19 Israel Pinto 2015-11-04 11:47:55 UTC
Created attachment 1089592 [details]
manual_migration_engine_log

Comment 20 Israel Pinto 2015-11-04 11:48:44 UTC
Created attachment 1089593 [details]
manual_migration_host_1_log

Comment 21 Israel Pinto 2015-11-04 11:49:51 UTC
Created attachment 1089594 [details]
manual_migration_host_2_log

Comment 22 Israel Pinto 2015-11-04 11:51:20 UTC
Created attachment 1089595 [details]
memory_hotplug_engine_log

Comment 23 Israel Pinto 2015-11-04 11:52:09 UTC
Created attachment 1089596 [details]
memory_hotplug_host_1_log

Comment 24 Israel Pinto 2015-11-04 11:52:48 UTC
Created attachment 1089597 [details]
memory_hotplug_host_2_log

Comment 25 Israel Pinto 2015-11-04 11:53:30 UTC
Created attachment 1089598 [details]
with_spice_video_engine_log

Comment 26 Israel Pinto 2015-11-04 11:54:12 UTC
Created attachment 1089599 [details]
with_spice_video_host_1_log

Comment 27 Israel Pinto 2015-11-04 11:54:45 UTC
Created attachment 1089600 [details]
with_spice_video_host_2_log

Comment 28 Bill Sanford 2015-11-04 13:30:12 UTC
Just a few seconds after the migration hits 99% (As seen in RHEV-M view of the VM), the sound gets really choppy and just about every second it cuts out, but the video portion seems fine. After the video I was streaming is done, then it might migrate after the bulk of the streaming is done and has very little load for the migration to deal with. Until the migration fails or succeeds, the degradation of the sound is very pronounced and quite unacceptable.

Comment 29 Tomas Jelinek 2015-11-05 10:18:08 UTC
(In reply to Israel Pinto from comment #18)
> I manage to reproduce this bug,  See below summary of what i tested.
> I think the failures release to the memory size of the VM, since we run with
> network card speed of 1000.
> Maybe if we test with 10G card we will not see this problem.
> 
> 1. VM: Fedora 22 with 10GB RAM
> Case (run each case 5 times)
> Manual migration (automatic host) - 2 failures of 5 runs 
> Migration with Spice opened with video running	3 failures of 5 runs
> 
> 2. VM: Fedora 22 with 7GB RAM
> Case (run each case 3 times)	
> Manual migration (automatic host)-Pass
> Manual migration (choose host)-Pass
> Maintenance-Pass
> Maintenance host SPM-Pass
> Migration after memory hot plug- 1 failure of 3 runs
> Note: It  happen probably with we enlarge memory
> Migration with Spice opened with running video - 2 failures of 3 runs
> 
> Set up (nested hosts):
> Engine:  3.6.0.2-.0.1
> Hosts:
> OS Version: RHEL - 7.2 - 9.el7
> Kernel Version: 3.10.0 - 327.el7.x86_64
> KVM Version:2.3.0 - 31.el7
> LIBVIRT Version: libvirt-1.2.17-13.el7
> VDSM Version: vdsm-4.17.10-5.el7ev
> SPICE Version: 0.12.4 - 15.el7
> 
> attaching logs

All failed migrations seem to be caused by general convergence problem (e.g. the guest was able to trash memory faster than it was possible to migrate it). Than, the migration_progress_timeout has been reached and the migration was cancelled.

There are some ways how to enhance the convergence by tweaking the vdsm.conf
- enlarge migration_progress_timeout - vdsm will try to migrate it for a longer time before giving up
- enlarge the migration_max_bandwidth - it is the max bandwidth used by migration which is by default 32 mbps.
- enlarge migration_downtime - it is for how long can the VM be paused at the end of the migration. If you set a high number, there is a bigger chance the VM will get migrated for the cost of having the VM paused for some time

Please note there is a BZ targeted to oVirt 4.0 which should make serious changes in migrations enhancing the convergence: https://bugzilla.redhat.com/show_bug.cgi?id=1252426
and a feature page: www.ovirt.org/Features/Migration_Enhancements
Some patches proposed: https://gerrit.ovirt.org/#/q/topic:migration-enhancements

Comment 30 Michal Skrivanek 2015-11-05 15:46:39 UTC
(In reply to Bill Sanford from comment #28)
> Just a few seconds after the migration hits 99% (As seen in RHEV-M view of
> the VM), the sound gets really choppy and just about every second it cuts
> out, but the video portion seems fine. After the video I was streaming is
> done, then it might migrate after the bulk of the streaming is done and has
> very little load for the migration to deal with. Until the migration fails
> or succeeds, the degradation of the sound is very pronounced and quite
> unacceptable.

Bill, can you please check comment #4, if so, then your case is indeed a bug 1247237 as noted by Tomas

Comment 31 Michal Skrivanek 2015-11-05 15:50:00 UTC
now the third issue mixed up in this bug is from Andrei...:)
so andrei, can you please check the relevance of smart card? I.e. do you see crashed VMs without it? 
Please make sure you filter out convergence issues as noted by tjamrisk and tjelinek in earlier comments

Based on that we can see if your issues are either general convergence, smartcard migration or something new worth tracking here in this bug. 
Thanks!

Comment 32 Andrei Stepanov 2015-11-05 16:11:30 UTC
For me, migration fails (kvm got killed) only for client with attached SM.

I have opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1277471
It has backtraces for kvm.

VM can crash even without migration.
Migration just helps to crash VM more quickly.

Comment 33 Michal Skrivanek 2015-11-05 16:13:14 UTC
restoring needinfo from comment #30

Comment 34 Bill Sanford 2015-11-05 19:18:17 UTC
Michal, I did verify that comment 4 did clear up the migration issue.

*** This bug has been marked as a duplicate of bug 1247237 ***


Note You need to log in before you can comment on or make changes to this bug.