522850 – Unable to live-migrate two domUs at the same time

Bug 522850 - Unable to live-migrate two domUs at the same time

Summary: Unable to live-migrate two domUs at the same time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	517236 519401 (view as bug list)
Depends On:
Blocks:	466197 499522
TreeView+	depends on / blocked

Reported:	2009-09-11 18:55 UTC by Bryan Mason
Modified:	2018-10-27 14:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	xen-3.0.3-100.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:58:14 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Possible patch (5.10 KB, patch) 2009-10-07 17:12 UTC, Bryan Mason	no flags	Details \| Diff
Set FD_CLOEXEC flag on all file descriptors (6.89 KB, patch) 2009-10-26 16:23 UTC, Jiri Denemark	no flags	Details \| Diff
Set FD_CLOEXEC flag on all file descriptors (7.07 KB, patch) 2009-10-27 11:29 UTC, Jiri Denemark	no flags	Details \| Diff
Set FD_CLOEXEC flag on all file descriptors (7.06 KB, patch) 2009-10-27 11:31 UTC, Jiri Denemark	no flags	Details \| Diff
Set FD_CLOEXEC flag when opening file descriptors (9.32 KB, patch) 2009-11-03 15:18 UTC, Jiri Denemark	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2010:0294	0	normal	SHIPPED_LIVE	xen bug fix and enhancement update	2010-03-29 14:20:32 UTC

Description Bryan Mason 2009-09-11 18:55:51 UTC

Description of problem:

    Attempting to live-migrate two xen guests at the same time results
    in one of the migrated guests in a paused state.

    The simplest scenario assumes we have guest "g1" on host "h1" and
    guest "g2" on host "h2."  Attempting to migrate g1 -> h2 and g2 ->
    h1 will result in either g1 or g2 being paused.

Version-Release number of selected component (if applicable):

    xen-3.0.3-94.el5-x86_64

How reproducible:

   Every time.

Steps to Reproduce:

    Start with a four node cluster consisting of RHEL 5.4 (x86_64)
    nodes: host01, host02, host03, host04

    The cluster contains three virtual machines: virt01, virt02, and
    virt03.  These VMs can be migrated to any node in the cluster as
    long as only one is migrated at a time.

    Before attempting a simultaneous live migration of two VMs, the
    cluster was in this state:

    virt01 was running on host02.
    virt02 was running on host04.
    virt03 was running on host01.

    Attempt to live migrate both virt01 and virt02 to node host03. On
    host04, run the following command:

        clusvcadm -M vm:virt01 -m host03i

    AND simultaneously on host03, run:

        clusvcadm -M vm:virt02 -m host01i
  
Actual results:

    The command on host03 returned first with a success. I had a
    session established with virt02, and this session continued
    unaffected. Clustat correctly reported that the VM was now running
    on host03. A few seconds later the command on host04 failed and I
    lost my session with virt01.

    VM virt01 has now entered the 'funny' state. Clustat reports that
    the VM is now running on host03, and `virsh list` on host03 shows
    that the VM has "no state":

        [root@host03 xen]# virsh list
        Id Name                 State
        ---------------------------------
         0 Domain-0             running
         7 virt01               no state
         8 virt02               idle

    I cannot ping or establish a console session with virt01 on host03:

    [root@host03 xen]# virsh console virt01
    No console available for domain

    On host02 (where virt01 was originally running), I have a
    left-over "migrating-virt01" domain:

        [root@host02 xen]# virsh list
        Id Name                 State
        ---------------------------------
         0 Domain-0             running
         3 migrating-virt01     idle

Expected results:

    Both guests should migrat with no problems.

Additional info:

    The reproduction steps and results above are from one of our
    customers.  I am attempting to reproduce this without Cluster
    Suite.

    This looks very similar to Bug 519401, Bug 512300 (except that this
    is not a local migration), and Bug 519401.

Comment 1 Chris Lalancette 2009-09-14 07:02:54 UTC

This patch recently was posted to upstream xen-devel:

http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html

And was committed to xen-unstable.hg as c/s 19990.  It might be useful as a starting point here.

Chris Lalancette

Comment 2 Jiri Denemark 2009-09-14 07:32:12 UTC

"fix cross migrate problem" sent on Sep 3 by James Song <jsong> (not in archives as they end on August) may also be relevant. Although sock.close() shouldn't really be removed and I would suspect sock.shutdown(2) would do the same as sock.shutdown() but I haven't checked it.

Actually c/s 20157
http://xenbits.xensource.com/xen-unstable.hg?rev/b4b79f3e3118 looks like the correct version of that patch although with a different comment which makes it a little bit less relevant to this BZ.

Comment 3 Bryan Mason 2009-09-16 17:26:19 UTC

The customer has reported that, as described in http://lists.xensource.com/archives/html/xen-devel/2009-07/msg01094.html, the migration completes once the running domain is shut down.  They believe that the problem they are experiencing is the same as the one described in that message.

Based on that information, I've attempted to backport the patch referenced in Comment #1 to xen-3.0.3-94.el5 and built test packages which have been provided to the customer.  If the tests are successful, I will post the patch here (it will probably need some additional work before adding to RHEL).

Comment 5 Bryan Mason 2009-10-07 17:11:32 UTC

There appear to be two migration cases that can produce failure in xen-3.0.3-94.el5:

1) Simultaneous migration of two guests to the same host

   Host:    host001     host002     host003
   ----------------------------------------
   Guest:   guest01 --------------> guest01
                        guest02 --> guest02

2) Simultaneous swapping of guests between two hosts:

   Host:    host001     host002
   ----------------------------
   Guest:   guest01 --> guest01
            guest02 <-- guest02

Test packages that I built with an attempted backport of the patch referenced in Comment #1 seemed to have resolved case #1, but case #2 still results in failure as described above.

Comment 6 Bryan Mason 2009-10-07 17:12:38 UTC

Created attachment 364007 [details]
Possible patch

Attempted backport of the patch referenced in Comment #1.

Comment 7 Jiri Denemark 2009-10-22 12:10:30 UTC

Ah, I was finally able to reproduce this issue. It looks like a problem between xen and libvirt. Using xm migrate, I can cross-migrate without any problems. Using virsh is, however, completely different and I get a paused guest...

Comment 8 Jiri Denemark 2009-10-22 14:15:36 UTC

Also it works even with virsh for pv guests without PVFB, that is for guests that don't need qemu-dm. However, xm migrate works for me in all cases (PV, PV+PVFB, FV).

Comment 9 Jiri Denemark 2009-10-23 09:59:05 UTC

Bryan, could you check that after swapping of guests between two hosts (case 2 in comment #5) you can unpause the paused guest by shutting down the other guest? In other words, after swapping, the situation could look like the following:

   host001            host002
guest02 (paused)  guest01 (running)

then, after shutting down guest01, the second guest (guest02) should be unpaused.

Comment 10 Jiri Denemark 2009-10-23 12:06:10 UTC

Ah, interesting, I was able to reproduce this with xm migrate, too. So it was just a very bad luck I had that day I was trying this.

Comment 11 Jiri Denemark 2009-10-23 13:42:39 UTC

*** Bug 517236 has been marked as a duplicate of this bug. ***

Comment 14 Jiri Denemark 2009-10-26 16:23:40 UTC

Created attachment 366121 [details]
Set FD_CLOEXEC flag on all file descriptors

An attempt to fix this differently

Comment 16 Jiri Denemark 2009-10-27 11:29:52 UTC

Created attachment 366239 [details]
Set FD_CLOEXEC flag on all file descriptors

Missing import added

Comment 17 Jiri Denemark 2009-10-27 11:31:32 UTC

Created attachment 366241 [details]
Set FD_CLOEXEC flag on all file descriptors

And really fixed version now...

Comment 18 Issue Tracker 2009-10-27 12:12:07 UTC

Event posted on 2009-10-27 13:10 CET by jturro

Customer tested the packages from comment #15 in the "swapping" case
(scenario 2 from comment #5) and they report:
 
The new Xen packages seem to have resolved the 'swapping' VM issue.

pep


Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by jturro 
 issue 315699

Comment 19 Jiri Denemark 2009-10-27 13:17:45 UTC

Great, thanks for the testing.

Comment 21 Jiri Denemark 2009-11-03 15:18:52 UTC

Created attachment 367311 [details]
Set FD_CLOEXEC flag when opening file descriptors

Comment 27 Jiri Denemark 2009-12-16 15:14:32 UTC

*** Bug 519401 has been marked as a duplicate of this bug. ***

Comment 32 Jiri Denemark 2010-01-08 08:14:16 UTC

You don't really need to bother with cluster stuff to reproduce the bug. You can do so with just two simple xm migrate commands, which is what cluster ends up doing anyway.

Comment 33 Rita Wu 2010-01-12 08:51:56 UTC

Verified on xen-3.0.3-103.el5 + kernel-xen-2.6.18-164.el5

I tested it in 2 scenarios mentioned in comment 5. And after migration, both the 2 domains are running.

Comment 35 errata-xmlrpc 2010-03-30 08:58:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0294.html

Comment 38 Paolo Bonzini 2010-04-08 15:46:28 UTC

This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).

Note You need to log in before you can comment on or make changes to this bug.