Description of problem: Migration fails to migrate. The RHEV-M install was working and ready to migrate. Version-Release number of selected component (if applicable): RHEL-6.7-20150811.1 & RHEV-M 3.6.0-17 -Engine RHEL-7.2-20151027.3 - Client, Host Windows 7x86_64 SP1 - Guest virt-viewer-2.0-7.el6.x86_64 Virt-viewer Method: Firefox-*.vv file How reproducible: 80% Steps to Reproduce: 1. 2. 3. Actual results: Migration fails Expected results: Migration works Additional info: I did a tail -f vdsm.log and the output is: http://pastebin.test.redhat.com/324332 The only fail I saw was: Thread-102474::ERROR::2015-11-02 11:43:14,739::migration::310::virt.vm::(run) vmId=`96e3c835-7fbb-4d36-8150-4c399aaff1a3`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration self._perform_migration(duri, muri) File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration self._vm._dom.migrateToURI3(duri, params, flags) File "/usr/share/vdsm/virt/virdomain.py", line 68, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: operation aborted: migration job: canceled by client mailbox.SPMMonitor::DEBUG::2015-11-02 11:43:15,458::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/00000001-0001-0001-0001-000000000085/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
During the migration, I was playing a video from youtube, either using HTML5 or Flash player, with same results.
Bill, can you take a look at this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1251541 and maybe consider closing this as a duplicate of it? Besides, vdsm sets a rather restrictive bandwidth limit for migration, so could you make sure, that your [vars] section in /etc/vdsm/vdsm.conf on -both- hosts looks like this: [vars] ssl = true migration_max_bandwidth = 0 and try again, migration works perfectly fine in our setup. And both bugs would be duplicates of this https://bugzilla.redhat.com/show_bug.cgi?id=1247237
I can confirm that I frequently see a very similar backtrace. It is not always reproducible, it is about 1 from 10. In my case VM got killed. Thread-172314::ERROR::2015-11-03 10:32:09,424::migration::310::virt.vm::(run) vmId=`5fc3ede4-7bc7-4bca-a657-4ccb537e2e03`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 364, in _startUnderlyingMigration self._perform_migration(duri, muri) File "/usr/share/vdsm/virt/migration.py", line 403, in _perform_migration self._vm._dom.migrateToURI3(duri, params, flags) File "/usr/share/vdsm/virt/virdomain.py", line 68, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid '5fc3ede4-7bc7-4bca-a657-4ccb537e2e03' (WIN-XP)
Created attachment 1088912 [details] VDSM log for Windows XP.
Created attachment 1088931 [details] test1: qemu log for hyper02
Created attachment 1088932 [details] test1: vdsm log for hyper02
Created attachment 1088933 [details] test1: libvirt log for hyper02
Created attachment 1088934 [details] test1: qemu log for hyper01
Created attachment 1088935 [details] test1: vdsm log for hyper01
Created attachment 1088936 [details] test1: libvirt log for hyper01
Looking at libvirt logs, could it be related to bug 1261430 ? (deducing from the log line: "Unsupported migration cookie feature memory-hotplug" )
I have suspicion that my case is related to https://bugzilla.redhat.com/show_bug.cgi?id=1277471 What is necessary to drop down VM is to migrate it few times with attached smartcard.
Created attachment 1089167 [details] 1/2 hosts of migration failure
Created attachment 1089168 [details] 2/2 hosts of migration failure
I manage to reproduce this bug, See below summary of what i tested. I think the failures release to the memory size of the VM, since we run with network card speed of 1000. Maybe if we test with 10G card we will not see this problem. 1. VM: Fedora 22 with 10GB RAM Case (run each case 5 times) Manual migration (automatic host) - 2 failures of 5 runs Migration with Spice opened with video running 3 failures of 5 runs 2. VM: Fedora 22 with 7GB RAM Case (run each case 3 times) Manual migration (automatic host)-Pass Manual migration (choose host)-Pass Maintenance-Pass Maintenance host SPM-Pass Migration after memory hot plug- 1 failure of 3 runs Note: It happen probably with we enlarge memory Migration with Spice opened with running video - 2 failures of 3 runs Set up (nested hosts): Engine: 3.6.0.2-.0.1 Hosts: OS Version: RHEL - 7.2 - 9.el7 Kernel Version: 3.10.0 - 327.el7.x86_64 KVM Version:2.3.0 - 31.el7 LIBVIRT Version: libvirt-1.2.17-13.el7 VDSM Version: vdsm-4.17.10-5.el7ev SPICE Version: 0.12.4 - 15.el7 attaching logs
Created attachment 1089592 [details] manual_migration_engine_log
Created attachment 1089593 [details] manual_migration_host_1_log
Created attachment 1089594 [details] manual_migration_host_2_log
Created attachment 1089595 [details] memory_hotplug_engine_log
Created attachment 1089596 [details] memory_hotplug_host_1_log
Created attachment 1089597 [details] memory_hotplug_host_2_log
Created attachment 1089598 [details] with_spice_video_engine_log
Created attachment 1089599 [details] with_spice_video_host_1_log
Created attachment 1089600 [details] with_spice_video_host_2_log
Just a few seconds after the migration hits 99% (As seen in RHEV-M view of the VM), the sound gets really choppy and just about every second it cuts out, but the video portion seems fine. After the video I was streaming is done, then it might migrate after the bulk of the streaming is done and has very little load for the migration to deal with. Until the migration fails or succeeds, the degradation of the sound is very pronounced and quite unacceptable.
(In reply to Israel Pinto from comment #18) > I manage to reproduce this bug, See below summary of what i tested. > I think the failures release to the memory size of the VM, since we run with > network card speed of 1000. > Maybe if we test with 10G card we will not see this problem. > > 1. VM: Fedora 22 with 10GB RAM > Case (run each case 5 times) > Manual migration (automatic host) - 2 failures of 5 runs > Migration with Spice opened with video running 3 failures of 5 runs > > 2. VM: Fedora 22 with 7GB RAM > Case (run each case 3 times) > Manual migration (automatic host)-Pass > Manual migration (choose host)-Pass > Maintenance-Pass > Maintenance host SPM-Pass > Migration after memory hot plug- 1 failure of 3 runs > Note: It happen probably with we enlarge memory > Migration with Spice opened with running video - 2 failures of 3 runs > > Set up (nested hosts): > Engine: 3.6.0.2-.0.1 > Hosts: > OS Version: RHEL - 7.2 - 9.el7 > Kernel Version: 3.10.0 - 327.el7.x86_64 > KVM Version:2.3.0 - 31.el7 > LIBVIRT Version: libvirt-1.2.17-13.el7 > VDSM Version: vdsm-4.17.10-5.el7ev > SPICE Version: 0.12.4 - 15.el7 > > attaching logs All failed migrations seem to be caused by general convergence problem (e.g. the guest was able to trash memory faster than it was possible to migrate it). Than, the migration_progress_timeout has been reached and the migration was cancelled. There are some ways how to enhance the convergence by tweaking the vdsm.conf - enlarge migration_progress_timeout - vdsm will try to migrate it for a longer time before giving up - enlarge the migration_max_bandwidth - it is the max bandwidth used by migration which is by default 32 mbps. - enlarge migration_downtime - it is for how long can the VM be paused at the end of the migration. If you set a high number, there is a bigger chance the VM will get migrated for the cost of having the VM paused for some time Please note there is a BZ targeted to oVirt 4.0 which should make serious changes in migrations enhancing the convergence: https://bugzilla.redhat.com/show_bug.cgi?id=1252426 and a feature page: www.ovirt.org/Features/Migration_Enhancements Some patches proposed: https://gerrit.ovirt.org/#/q/topic:migration-enhancements
(In reply to Bill Sanford from comment #28) > Just a few seconds after the migration hits 99% (As seen in RHEV-M view of > the VM), the sound gets really choppy and just about every second it cuts > out, but the video portion seems fine. After the video I was streaming is > done, then it might migrate after the bulk of the streaming is done and has > very little load for the migration to deal with. Until the migration fails > or succeeds, the degradation of the sound is very pronounced and quite > unacceptable. Bill, can you please check comment #4, if so, then your case is indeed a bug 1247237 as noted by Tomas
now the third issue mixed up in this bug is from Andrei...:) so andrei, can you please check the relevance of smart card? I.e. do you see crashed VMs without it? Please make sure you filter out convergence issues as noted by tjamrisk and tjelinek in earlier comments Based on that we can see if your issues are either general convergence, smartcard migration or something new worth tracking here in this bug. Thanks!
For me, migration fails (kvm got killed) only for client with attached SM. I have opened a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1277471 It has backtraces for kvm. VM can crash even without migration. Migration just helps to crash VM more quickly.
restoring needinfo from comment #30
Michal, I did verify that comment 4 did clear up the migration issue. *** This bug has been marked as a duplicate of bug 1247237 ***