Description of problem: Attempting to live-migrate two xen guests at the same time results in one of the migrated guests in a paused state. The simplest scenario assumes we have guest "g1" on host "h1" and guest "g2" on host "h2." Attempting to migrate g1 -> h2 and g2 -> h1 will result in either g1 or g2 being paused. Version-Release number of selected component (if applicable): xen-3.0.3-94.el5-x86_64 How reproducible: Every time. Steps to Reproduce: Start with a four node cluster consisting of RHEL 5.4 (x86_64) nodes: host01, host02, host03, host04 The cluster contains three virtual machines: virt01, virt02, and virt03. These VMs can be migrated to any node in the cluster as long as only one is migrated at a time. Before attempting a simultaneous live migration of two VMs, the cluster was in this state: virt01 was running on host02. virt02 was running on host04. virt03 was running on host01. Attempt to live migrate both virt01 and virt02 to node host03. On host04, run the following command: clusvcadm -M vm:virt01 -m host03i AND simultaneously on host03, run: clusvcadm -M vm:virt02 -m host01i Actual results: The command on host03 returned first with a success. I had a session established with virt02, and this session continued unaffected. Clustat correctly reported that the VM was now running on host03. A few seconds later the command on host04 failed and I lost my session with virt01. VM virt01 has now entered the 'funny' state. Clustat reports that the VM is now running on host03, and `virsh list` on host03 shows that the VM has "no state": [root@host03 xen]# virsh list Id Name State --------------------------------- 0 Domain-0 running 7 virt01 no state 8 virt02 idle I cannot ping or establish a console session with virt01 on host03: [root@host03 xen]# virsh console virt01 No console available for domain On host02 (where virt01 was originally running), I have a left-over "migrating-virt01" domain: [root@host02 xen]# virsh list Id Name State --------------------------------- 0 Domain-0 running 3 migrating-virt01 idle Expected results: Both guests should migrat with no problems. Additional info: The reproduction steps and results above are from one of our customers. I am attempting to reproduce this without Cluster Suite. This looks very similar to Bug 519401, Bug 512300 (except that this is not a local migration), and Bug 519401.
This patch recently was posted to upstream xen-devel: http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html And was committed to xen-unstable.hg as c/s 19990. It might be useful as a starting point here. Chris Lalancette
"fix cross migrate problem" sent on Sep 3 by James Song <jsong> (not in archives as they end on August) may also be relevant. Although sock.close() shouldn't really be removed and I would suspect sock.shutdown(2) would do the same as sock.shutdown() but I haven't checked it. Actually c/s 20157 http://xenbits.xensource.com/xen-unstable.hg?rev/b4b79f3e3118 looks like the correct version of that patch although with a different comment which makes it a little bit less relevant to this BZ.
The customer has reported that, as described in http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html, the migration completes once the running domain is shut down. They believe that the problem they are experiencing is the same as the one described in that message. Based on that information, I've attempted to backport the patch referenced in Comment #1 to xen-3.0.3-94.el5 and built test packages which have been provided to the customer. If the tests are successful, I will post the patch here (it will probably need some additional work before adding to RHEL).
There appear to be two migration cases that can produce failure in xen-3.0.3-94.el5: 1) Simultaneous migration of two guests to the same host Host: host001 host002 host003 ---------------------------------------- Guest: guest01 --------------> guest01 guest02 --> guest02 2) Simultaneous swapping of guests between two hosts: Host: host001 host002 ---------------------------- Guest: guest01 --> guest01 guest02 <-- guest02 Test packages that I built with an attempted backport of the patch referenced in Comment #1 seemed to have resolved case #1, but case #2 still results in failure as described above.
Created attachment 364007 [details] Possible patch Attempted backport of the patch referenced in Comment #1.
Ah, I was finally able to reproduce this issue. It looks like a problem between xen and libvirt. Using xm migrate, I can cross-migrate without any problems. Using virsh is, however, completely different and I get a paused guest...
Also it works even with virsh for pv guests without PVFB, that is for guests that don't need qemu-dm. However, xm migrate works for me in all cases (PV, PV+PVFB, FV).
Bryan, could you check that after swapping of guests between two hosts (case 2 in comment #5) you can unpause the paused guest by shutting down the other guest? In other words, after swapping, the situation could look like the following: host001 host002 guest02 (paused) guest01 (running) then, after shutting down guest01, the second guest (guest02) should be unpaused.
Ah, interesting, I was able to reproduce this with xm migrate, too. So it was just a very bad luck I had that day I was trying this.
*** Bug 517236 has been marked as a duplicate of this bug. ***
Created attachment 366121 [details] Set FD_CLOEXEC flag on all file descriptors An attempt to fix this differently
Created attachment 366239 [details] Set FD_CLOEXEC flag on all file descriptors Missing import added
Created attachment 366241 [details] Set FD_CLOEXEC flag on all file descriptors And really fixed version now...
Event posted on 2009-10-27 13:10 CET by jturro Customer tested the packages from comment #15 in the "swapping" case (scenario 2 from comment #5) and they report: The new Xen packages seem to have resolved the 'swapping' VM issue. pep Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by jturro issue 315699
Great, thanks for the testing.
Created attachment 367311 [details] Set FD_CLOEXEC flag when opening file descriptors
*** Bug 519401 has been marked as a duplicate of this bug. ***
You don't really need to bother with cluster stuff to reproduce the bug. You can do so with just two simple xm migrate commands, which is what cluster ends up doing anyway.
Verified on xen-3.0.3-103.el5 + kernel-xen-2.6.18-164.el5 I tested it in 2 scenarios mentioned in comment 5. And after migration, both the 2 domains are running.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0294.html
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).