Bug 522850 - Unable to live-migrate two domUs at the same time
Summary: Unable to live-migrate two domUs at the same time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
: 517236 519401 (view as bug list)
Depends On:
Blocks: 466197 499522
TreeView+ depends on / blocked
 
Reported: 2009-09-11 18:55 UTC by Bryan Mason
Modified: 2018-10-27 14:59 UTC (History)
9 users (show)

Fixed In Version: xen-3.0.3-100.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 08:58:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Possible patch (5.10 KB, patch)
2009-10-07 17:12 UTC, Bryan Mason
no flags Details | Diff
Set FD_CLOEXEC flag on all file descriptors (6.89 KB, patch)
2009-10-26 16:23 UTC, Jiri Denemark
no flags Details | Diff
Set FD_CLOEXEC flag on all file descriptors (7.07 KB, patch)
2009-10-27 11:29 UTC, Jiri Denemark
no flags Details | Diff
Set FD_CLOEXEC flag on all file descriptors (7.06 KB, patch)
2009-10-27 11:31 UTC, Jiri Denemark
no flags Details | Diff
Set FD_CLOEXEC flag when opening file descriptors (9.32 KB, patch)
2009-11-03 15:18 UTC, Jiri Denemark
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0294 0 normal SHIPPED_LIVE xen bug fix and enhancement update 2010-03-29 14:20:32 UTC

Description Bryan Mason 2009-09-11 18:55:51 UTC
Description of problem:

    Attempting to live-migrate two xen guests at the same time results
    in one of the migrated guests in a paused state.

    The simplest scenario assumes we have guest "g1" on host "h1" and
    guest "g2" on host "h2."  Attempting to migrate g1 -> h2 and g2 ->
    h1 will result in either g1 or g2 being paused.

Version-Release number of selected component (if applicable):

    xen-3.0.3-94.el5-x86_64

How reproducible:

   Every time.

Steps to Reproduce:

    Start with a four node cluster consisting of RHEL 5.4 (x86_64)
    nodes: host01, host02, host03, host04

    The cluster contains three virtual machines: virt01, virt02, and
    virt03.  These VMs can be migrated to any node in the cluster as
    long as only one is migrated at a time.

    Before attempting a simultaneous live migration of two VMs, the
    cluster was in this state:

    virt01 was running on host02.
    virt02 was running on host04.
    virt03 was running on host01.

    Attempt to live migrate both virt01 and virt02 to node host03. On
    host04, run the following command:

        clusvcadm -M vm:virt01 -m host03i

    AND simultaneously on host03, run:

        clusvcadm -M vm:virt02 -m host01i
  
Actual results:

    The command on host03 returned first with a success. I had a
    session established with virt02, and this session continued
    unaffected. Clustat correctly reported that the VM was now running
    on host03. A few seconds later the command on host04 failed and I
    lost my session with virt01.

    VM virt01 has now entered the 'funny' state. Clustat reports that
    the VM is now running on host03, and `virsh list` on host03 shows
    that the VM has "no state":

        [root@host03 xen]# virsh list
        Id Name                 State
        ---------------------------------
         0 Domain-0             running
         7 virt01               no state
         8 virt02               idle

    I cannot ping or establish a console session with virt01 on host03:

    [root@host03 xen]# virsh console virt01
    No console available for domain

    On host02 (where virt01 was originally running), I have a
    left-over "migrating-virt01" domain:

        [root@host02 xen]# virsh list
        Id Name                 State
        ---------------------------------
         0 Domain-0             running
         3 migrating-virt01     idle

Expected results:

    Both guests should migrat with no problems.

Additional info:

    The reproduction steps and results above are from one of our
    customers.  I am attempting to reproduce this without Cluster
    Suite.

    This looks very similar to Bug 519401, Bug 512300 (except that this
    is not a local migration), and Bug 519401.

Comment 1 Chris Lalancette 2009-09-14 07:02:54 UTC
This patch recently was posted to upstream xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html

And was committed to xen-unstable.hg as c/s 19990.  It might be useful as a starting point here.

Chris Lalancette

Comment 2 Jiri Denemark 2009-09-14 07:32:12 UTC
"fix cross migrate problem" sent on Sep 3 by James Song <jsong> (not in archives as they end on August) may also be relevant. Although sock.close() shouldn't really be removed and I would suspect sock.shutdown(2) would do the same as sock.shutdown() but I haven't checked it.

Actually c/s 20157
http://xenbits.xensource.com/xen-unstable.hg?rev/b4b79f3e3118 looks like the correct version of that patch although with a different comment which makes it a little bit less relevant to this BZ.

Comment 3 Bryan Mason 2009-09-16 17:26:19 UTC
The customer has reported that, as described in http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html, the migration completes once the running domain is shut down.  They believe that the problem they are experiencing is the same as the one described in that message.

Based on that information, I've attempted to backport the patch referenced in Comment #1 to xen-3.0.3-94.el5 and built test packages which have been provided to the customer.  If the tests are successful, I will post the patch here (it will probably need some additional work before adding to RHEL).

Comment 5 Bryan Mason 2009-10-07 17:11:32 UTC
There appear to be two migration cases that can produce failure in xen-3.0.3-94.el5:

1) Simultaneous migration of two guests to the same host

   Host:    host001     host002     host003
   ----------------------------------------
   Guest:   guest01 --------------> guest01
                        guest02 --> guest02

2) Simultaneous swapping of guests between two hosts:

   Host:    host001     host002
   ----------------------------
   Guest:   guest01 --> guest01
            guest02 <-- guest02

Test packages that I built with an attempted backport of the patch referenced in Comment #1 seemed to have resolved case #1, but case #2 still results in failure as described above.

Comment 6 Bryan Mason 2009-10-07 17:12:38 UTC
Created attachment 364007 [details]
Possible patch

Attempted backport of the patch referenced in Comment #1.

Comment 7 Jiri Denemark 2009-10-22 12:10:30 UTC
Ah, I was finally able to reproduce this issue. It looks like a problem between xen and libvirt. Using xm migrate, I can cross-migrate without any problems. Using virsh is, however, completely different and I get a paused guest...

Comment 8 Jiri Denemark 2009-10-22 14:15:36 UTC
Also it works even with virsh for pv guests without PVFB, that is for guests that don't need qemu-dm. However, xm migrate works for me in all cases (PV, PV+PVFB, FV).

Comment 9 Jiri Denemark 2009-10-23 09:59:05 UTC
Bryan, could you check that after swapping of guests between two hosts (case 2 in comment #5) you can unpause the paused guest by shutting down the other guest? In other words, after swapping, the situation could look like the following:

   host001            host002
guest02 (paused)  guest01 (running)

then, after shutting down guest01, the second guest (guest02) should be unpaused.

Comment 10 Jiri Denemark 2009-10-23 12:06:10 UTC
Ah, interesting, I was able to reproduce this with xm migrate, too. So it was just a very bad luck I had that day I was trying this.

Comment 11 Jiri Denemark 2009-10-23 13:42:39 UTC
*** Bug 517236 has been marked as a duplicate of this bug. ***

Comment 14 Jiri Denemark 2009-10-26 16:23:40 UTC
Created attachment 366121 [details]
Set FD_CLOEXEC flag on all file descriptors

An attempt to fix this differently

Comment 16 Jiri Denemark 2009-10-27 11:29:52 UTC
Created attachment 366239 [details]
Set FD_CLOEXEC flag on all file descriptors

Missing import added

Comment 17 Jiri Denemark 2009-10-27 11:31:32 UTC
Created attachment 366241 [details]
Set FD_CLOEXEC flag on all file descriptors

And really fixed version now...

Comment 18 Issue Tracker 2009-10-27 12:12:07 UTC
Event posted on 2009-10-27 13:10 CET by jturro

Customer tested the packages from comment #15 in the "swapping" case
(scenario 2 from comment #5) and they report:
 
The new Xen packages seem to have resolved the 'swapping' VM issue.

pep


Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by jturro 
 issue 315699

Comment 19 Jiri Denemark 2009-10-27 13:17:45 UTC
Great, thanks for the testing.

Comment 21 Jiri Denemark 2009-11-03 15:18:52 UTC
Created attachment 367311 [details]
Set FD_CLOEXEC flag when opening file descriptors

Comment 27 Jiri Denemark 2009-12-16 15:14:32 UTC
*** Bug 519401 has been marked as a duplicate of this bug. ***

Comment 32 Jiri Denemark 2010-01-08 08:14:16 UTC
You don't really need to bother with cluster stuff to reproduce the bug. You can do so with just two simple xm migrate commands, which is what cluster ends up doing anyway.

Comment 33 Rita Wu 2010-01-12 08:51:56 UTC
Verified on xen-3.0.3-103.el5 + kernel-xen-2.6.18-164.el5

I tested it in 2 scenarios mentioned in comment 5. And after migration, both the 2 domains are running.

Comment 35 errata-xmlrpc 2010-03-30 08:58:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0294.html

Comment 38 Paolo Bonzini 2010-04-08 15:46:28 UTC
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).


Note You need to log in before you can comment on or make changes to this bug.