Red Hat Bugzilla – Bug 315131
clustat reports wrong status when live migrate failed
Last modified: 2016-04-26 12:12:00 EDT
Description of problem:
For a host domain cluster, if xend is configured to NOT accept live migration,
but we issue a "service migration" command for vm service, the migration will
failed, but the status of that vm service will be marked as "failed" and ignored.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Setup a simple 2 node cluster (assume host1 and host2)
2. Add a paravirt guest (node1) on one of the machine and share out
/var/lib/xen/images by nfs
3. On another machine, mount it (rw), also copy the guest config file /etc/xen/node1
4. Use cluster suite to protect this guest OS as vm:node1 service
5. Start cluster suite and rgmanager on both hosts
6. The guest OS will be running on host1
7. Do a service migration (not relocation): clusvcadm -M vm:[node1] -m [host2]
1. As live migration is disabled, the migration will fail, and node1 will still
be running on host1
2. But clustat will show the service is "failed" and will ignore futher actions
1. Migration failed, but the guest OK is still UP and RUNNING, so clustat should
show it is started, not failed
[root@station2 ~]# clustat
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
station2.example.com 1 Online, rgmanager
station3.example.com 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
vm:node1 (station3.example.com) failed
[root@station2 ~]# xm list
Name ID Mem(MiB) VCPUs State Time(s)
Domain-0 0 7732 8 r----- 91.8
node1 1 191 1 -b---- 22.4
Created attachment 213211 [details]
All cluster version 5 defects should be reported under red hat enterprise linux
5 product name - not cluster suite.
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time. This request will be
reviewed for a future Red Hat Enterprise Linux release.
I'm not entirely sure what the exact problem is here. The original bug report seems to be about a situation when migrating attempt is not accepted by a target host. However, xm migrate reports errors in such cases depending why the migration couldn't be started:
[root@virval ~]# xm migrate --live rhel5-64 mig
Error: can't connect: Connection refused
[root@virval ~]# xm migrate --live rhel5-64 mig
Error: (104, 'Connection reset by peer')
Another option is that a guest is migrated to a target machine but then it fails to start there. In that case, xm migrate returns success because source xend just connects to the target, transfers guest's memory image and closes the connection. Xend doesn't support any kind of error reporting back to the original host once the image is transferred.
Or is there another problem I don't see?
(In reply to comment #13)
> Another option is that a guest is migrated to a target machine but then it
> fails to start there. In that case, xm migrate returns success because source
> xend just connects to the target, transfers guest's memory image and closes the
> connection. Xend doesn't support any kind of error reporting back to the
> original host once the image is transferred.
> Or is there another problem I don't see?
Lon can correct me if I'm wrong, but this is the most likely culprit. If the machine you are migrating to does not have enough RAM or some other system resources then you'll have a situation where the vm is migrated from one host to another but cannot be started. In this case the 'migration' was successful, but the 'start' was not.
I would contend that a 'live migration' is actually two parts. First is migrating the memory, second is starting the vm. If it's not required to start the vm to complete the migration, then it seems to me that it's no longer a 'live' migration but a dead one. The migration should only be considered a success iff the vm could be started on the destination machine. Otherwise the whole operation should fail, and the vm should continue running on the original host. Anything else leaves you with _no_ vm running on either host.
Hmm, but according to Lon, cluster already has some code for detecting and restarting guests when they don't actually start after being migrated.
Anyway, if this is the case, we have better bug report for it: bug 513431. Unfortunately, fixing it would require changing significantly the way source and target xend communicate with each other during migration.
yeah, Lon corrected me after I made that comment. So you can ignore comment #14 :)
So, there's no way to "hook" rgmanager in to migration after 'virsh|xm migrate' completes but before the start as far as I know, at least, not without coupling rgmanager to libvirtd.
The original bug deals with the fact that if migration aborts early (e.g. no connection, etc.), that the service goes to the 'failed' state instead of staying 'started'.
I thought most of this was already fixed later actually.
In 5.4, rgmanager shouldn't mark the migration as failed if obvious errors (e.g. failed to connect to remote hypervisor) occur.
So I guess we can close this as current version, right?
*** This bug has been marked as a duplicate of bug 499835 ***
Resolving as duplicate of later bugzilla since the problem space is more narrowly defined in that bugzilla.
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).