Bug 315131 - clustat reports wrong status when live migrate failed
Summary: clustat reports wrong status when live migrate failed
Keywords:
Status: CLOSED DUPLICATE of bug 499835
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.0
Hardware: i386
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 466197
TreeView+ depends on / blocked
 
Reported: 2007-10-02 08:57 UTC by Fai Wong
Modified: 2016-04-26 16:12 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-11-23 15:41:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster configuration (1.14 KB, text/plain)
2007-10-02 08:57 UTC, Fai Wong
no flags Details

Description Fai Wong 2007-10-02 08:57:11 UTC
Description of problem:
For a host domain cluster, if xend is configured to NOT accept live migration,
but we issue a "service migration" command for vm service, the migration will
failed, but the status of that vm service will be marked as "failed" and ignored.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Setup a simple 2 node cluster (assume host1 and host2)
2. Add a paravirt guest (node1) on one of the machine and share out
/var/lib/xen/images by nfs
3. On another machine, mount it (rw), also copy the guest config file /etc/xen/node1
4. Use cluster suite to protect this guest OS as vm:node1 service
5. Start cluster suite and rgmanager on both hosts
6. The guest OS will be running on host1
7. Do a service migration (not relocation): clusvcadm -M vm:[node1] -m [host2]

Actual results:
1. As live migration is disabled, the migration will fail, and node1 will still
be running on host1
2. But clustat will show the service is "failed"  and will ignore futher actions

Expected results:
1. Migration failed, but the guest OK is still UP and RUNNING, so clustat should
show it is started, not failed


Sample output:
[root@station2 ~]# clustat

Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  station2.example.com                  1 Online, rgmanager
  station3.example.com                  2 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  vm:node1             (station3.example.com)         failed          

[root@station2 ~]# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     7732     8 r-----     91.8
node1                                      1      191     1 -b----     22.4

Comment 1 Fai Wong 2007-10-02 08:57:11 UTC
Created attachment 213211 [details]
cluster configuration

Comment 2 Kiersten (Kerri) Anderson 2007-11-19 19:57:30 UTC
All cluster version 5 defects should be reported under red hat enterprise linux
5 product name - not cluster suite.

Comment 4 RHEL Program Management 2008-03-11 19:40:04 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 13 Jiri Denemark 2009-10-16 12:23:46 UTC
I'm not entirely sure what the exact problem is here. The original bug report seems to be about a situation when migrating attempt is not accepted by a target host. However, xm migrate reports errors in such cases depending why the migration couldn't be started:

[root@virval ~]# xm migrate --live rhel5-64 mig
Error: can't connect: Connection refused

or

[root@virval ~]# xm migrate --live rhel5-64 mig
Error: (104, 'Connection reset by peer')

Another option is that a guest is migrated to a target machine but then it fails to start there. In that case, xm migrate returns success because source xend just connects to the target, transfers guest's memory image and closes the connection. Xend doesn't support any kind of error reporting back to the original host once the image is transferred.

Or is there another problem I don't see?

Comment 14 Perry Myers 2009-10-19 13:11:19 UTC
(In reply to comment #13)
> Another option is that a guest is migrated to a target machine but then it
> fails to start there. In that case, xm migrate returns success because source
> xend just connects to the target, transfers guest's memory image and closes the
> connection. Xend doesn't support any kind of error reporting back to the
> original host once the image is transferred.
> 
> Or is there another problem I don't see?  

Lon can correct me if I'm wrong, but this is the most likely culprit.  If the machine you are migrating to does not have enough RAM or some other system resources then you'll have a situation where the vm is migrated from one host to another but cannot be started.  In this case the 'migration' was successful, but the 'start' was not.

I would contend that a 'live migration' is actually two parts.  First is migrating the memory, second is starting the vm.  If it's not required to start the vm to complete the migration, then it seems to me that it's no longer a 'live' migration but a dead one.  The migration should only be considered a success iff the vm could be started on the destination machine.  Otherwise the whole operation should fail, and the vm should continue running on the original host.  Anything else leaves you with _no_ vm running on either host.

Comment 15 Jiri Denemark 2009-10-19 13:50:53 UTC
Hmm, but according to Lon, cluster already has some code for detecting and restarting guests when they don't actually start after being migrated.

Anyway, if this is the case, we have better bug report for it: bug 513431. Unfortunately, fixing it would require changing significantly the way source and target xend communicate with each other during migration.

Comment 16 Perry Myers 2009-10-19 15:52:14 UTC
yeah, Lon corrected me after I made that comment.  So you can ignore comment #14 :)

Comment 17 Lon Hohberger 2009-10-19 19:31:03 UTC
So, there's no way to "hook" rgmanager in to migration after 'virsh|xm migrate' completes but before the start as far as I know, at least, not without coupling rgmanager to libvirtd.

The original bug deals with the fact that if migration aborts early (e.g. no connection, etc.), that the service goes to the 'failed' state instead of staying 'started'.

I thought most of this was already fixed later actually.

Comment 18 Lon Hohberger 2009-10-19 19:32:24 UTC
In 5.4, rgmanager shouldn't mark the migration as failed if obvious errors (e.g. failed to connect to remote hypervisor) occur.

Comment 19 Jiri Denemark 2009-10-20 17:07:05 UTC
So I guess we can close this as current version, right?

Comment 22 Lon Hohberger 2009-11-23 15:41:04 UTC

*** This bug has been marked as a duplicate of bug 499835 ***

Comment 23 Lon Hohberger 2009-11-23 15:42:06 UTC
Resolving as duplicate of later bugzilla since the problem space is more narrowly defined in that bugzilla.

Comment 24 Paolo Bonzini 2010-04-08 15:48:32 UTC
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).


Note You need to log in before you can comment on or make changes to this bug.