Bug 479247 - live migration in cluster suite does not work!
live migration in cluster suite does not work!
Status: CLOSED NEXTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.2
x86_64 Linux
urgent Severity urgent
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-08 07:16 EST by Christian Nilsson
Modified: 2009-04-16 18:18 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-14 11:30:06 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Christian Nilsson 2009-01-08 07:16:53 EST
Description of problem:
Live migration of xen guest does not work!

Version-Release number of selected component (if applicable):
rgmanager-2.0.43-1.el5.x86_64

How reproducible:
Always

Steps to Reproduce:
1.clusvcadm -M vm:<guest> -m <to_host>
2.Check if it has migrated.
  
Actual results:
xen5-1> clusvcadm -M vm:s0157 -m xen5-2-hb.sss.se.scania.com
Trying to migrate vm:s0157 to xen5-2-hb.sss.se.scania.com...Success

xen5-1> xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     2000     8 r-----   7748.5
s0157                                      1      255     2 -b----    675.9

xen5-2> xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     2000     8 r-----   6095.5

xen5-1> clustat

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:s0157                       xen5-2-hb.sss.se.scania.com    migrating

Expected results:
The virtual guest should have migrated.

Additional info:
Comment 1 Lon Hohberger 2009-01-08 11:53:28 EST
I have seen this before.

Migration works like this:

Source node:
- set state to 'migrating'
- call xm migrate vm-name target-node

Target node:
- I am the target owner of this VM
- Check periodically for the migration to complete

So, in your case xen5-2-hb is waiting for the migration to complete, but 'xm migrate' said that it had already completed on the source node.

The effect is a guest gets stuck in "migrating".  This is because the 'xm migrate' command that rgmanager uses to migrate the virtual machine returns with a successful return code after a partial migration has occurred or in some cases, when no migration attempts have been made at all.  I have seen it occur with the target node having the VM in paused state and the source node having the VM still running.

Within rgmanager we need to intelligently detect (and resolve) this situation, but there's nothing we can do to fix a false success from 'xm' as far as I know.  Fixing the false success would require work on the Xen side.

So, if we changed migration to work like this:

Source node:
- set state to 'migrating'
- call xm migrate
- Notify remote note to do a check
  - if check fails
    - mark vm failed with *source* node as the previous owner
    - issue 'xm destroy' on target node, just in case
  - if check succeeds, flip state

This makes migration more synchronous, eliminates the 'stick' in migrating state.  Furthermore, we can log failure (migration didn't actually complete).

We need to know:
 - Kernel version
 - Xen version
Comment 2 Matthew LeSieur 2009-01-08 15:46:55 EST
I have also run into this problem today.  After some debugging, I was able to determine that the live migration works after removing lines 488 to 492 from /usr/share/cluster/vm.sh.

        # Patch from Marcelo Azevedo to migrate over private
        # LANs instead of public LANs
        if [ -n $OCF_RESKEY_migration_mapping ]; then
               target=${OCF_RESKEY_migration_mapping#*$target:} target=${target%%,*}
        fi

I have tried both not setting the migration_mapping setting and setting it to a blank value, and both times I get the same results.  When the script gets to the

err=$(xm migrate $migrate_opt $OCF_RESKEY_name $target 2>&1 | head -1)

line (494) in vm.sh, the $target is no longer set and xm throws this error:

Error: Invalid number of arguments.

I am running Red Hat Enterprise Linux 5.3.
Comment 3 Christian Nilsson 2009-01-09 02:24:44 EST
This is what you requested:
kernel-xen-2.6.18-120.el5.x86_64
xen-3.0.3-64.el5_2.3.x86_64
Comment 4 Christian Nilsson 2009-01-09 02:40:51 EST
I am also running Red Hat Enterprise Linux 5.3 and have'nt seen this problem in 5.2.
Comment 5 Lon Hohberger 2009-01-09 13:40:19 EST
(In reply to comment #2)
> I have also run into this problem today.  After some debugging, I was able to
> determine that the live migration works after removing lines 488 to 492 from
> /usr/share/cluster/vm.sh.

>         if [ -n $OCF_RESKEY_migration_mapping ]; then

This line is wrong and is fixed in 5.3

> I am running Red Hat Enterprise Linux 5.3.

5.3 Beta
Comment 7 Lon Hohberger 2009-01-09 13:46:56 EST
(In reply to comment #4)
> I am also running Red Hat Enterprise Linux 5.3 and have'nt seen this problem in
> 5.2.

Maybe the problem I identified in comment #2 is no longer the case on 5.3...  That would be pretty awesome.
Comment 8 Lon Hohberger 2009-01-09 13:53:03 EST
Oops, comment #1.

You guys can test the vm.sh script I referenced above if you want, but basically it fixes bad logic:

  - No migration_mapping -> no variable $OCF_RESKEY_migration_mapping

  - No shell variable translates that line to:

      if [ -n  ]; then

    This is obviously a syntax error; -n requires an argument.
 
  - The fix makes it expand to:

      if [ -n "" ]; then

    ... which is what it should have been.
Comment 9 Matthew LeSieur 2009-01-09 15:15:14 EST
The vm.sh script referenced in comment #6 fixed the problem.  I am able to migrate a domain using clusvcadm.  Thanks for the quick turn-around time on this.
Comment 10 Christian Nilsson 2009-01-12 02:54:18 EST
I have tested your change in rhel5.3 beta and It seems to work now.
Thanks for the quick turn-around time on this issue.

//Christian

Note You need to log in before you can comment on or make changes to this bug.