Description of problem: For a host domain cluster, if xend is configured to NOT accept live migration, but we issue a "service migration" command for vm service, the migration will failed, but the status of that vm service will be marked as "failed" and ignored. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Setup a simple 2 node cluster (assume host1 and host2) 2. Add a paravirt guest (node1) on one of the machine and share out /var/lib/xen/images by nfs 3. On another machine, mount it (rw), also copy the guest config file /etc/xen/node1 4. Use cluster suite to protect this guest OS as vm:node1 service 5. Start cluster suite and rgmanager on both hosts 6. The guest OS will be running on host1 7. Do a service migration (not relocation): clusvcadm -M vm:[node1] -m [host2] Actual results: 1. As live migration is disabled, the migration will fail, and node1 will still be running on host1 2. But clustat will show the service is "failed" and will ignore futher actions Expected results: 1. Migration failed, but the guest OK is still UP and RUNNING, so clustat should show it is started, not failed Sample output: [root@station2 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ station2.example.com 1 Online, rgmanager station3.example.com 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:node1 (station3.example.com) failed [root@station2 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 7732 8 r----- 91.8 node1 1 191 1 -b---- 22.4
Created attachment 213211 [details] cluster configuration
All cluster version 5 defects should be reported under red hat enterprise linux 5 product name - not cluster suite.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
I'm not entirely sure what the exact problem is here. The original bug report seems to be about a situation when migrating attempt is not accepted by a target host. However, xm migrate reports errors in such cases depending why the migration couldn't be started: [root@virval ~]# xm migrate --live rhel5-64 mig Error: can't connect: Connection refused or [root@virval ~]# xm migrate --live rhel5-64 mig Error: (104, 'Connection reset by peer') Another option is that a guest is migrated to a target machine but then it fails to start there. In that case, xm migrate returns success because source xend just connects to the target, transfers guest's memory image and closes the connection. Xend doesn't support any kind of error reporting back to the original host once the image is transferred. Or is there another problem I don't see?
(In reply to comment #13) > Another option is that a guest is migrated to a target machine but then it > fails to start there. In that case, xm migrate returns success because source > xend just connects to the target, transfers guest's memory image and closes the > connection. Xend doesn't support any kind of error reporting back to the > original host once the image is transferred. > > Or is there another problem I don't see? Lon can correct me if I'm wrong, but this is the most likely culprit. If the machine you are migrating to does not have enough RAM or some other system resources then you'll have a situation where the vm is migrated from one host to another but cannot be started. In this case the 'migration' was successful, but the 'start' was not. I would contend that a 'live migration' is actually two parts. First is migrating the memory, second is starting the vm. If it's not required to start the vm to complete the migration, then it seems to me that it's no longer a 'live' migration but a dead one. The migration should only be considered a success iff the vm could be started on the destination machine. Otherwise the whole operation should fail, and the vm should continue running on the original host. Anything else leaves you with _no_ vm running on either host.
Hmm, but according to Lon, cluster already has some code for detecting and restarting guests when they don't actually start after being migrated. Anyway, if this is the case, we have better bug report for it: bug 513431. Unfortunately, fixing it would require changing significantly the way source and target xend communicate with each other during migration.
yeah, Lon corrected me after I made that comment. So you can ignore comment #14 :)
So, there's no way to "hook" rgmanager in to migration after 'virsh|xm migrate' completes but before the start as far as I know, at least, not without coupling rgmanager to libvirtd. The original bug deals with the fact that if migration aborts early (e.g. no connection, etc.), that the service goes to the 'failed' state instead of staying 'started'. I thought most of this was already fixed later actually.
In 5.4, rgmanager shouldn't mark the migration as failed if obvious errors (e.g. failed to connect to remote hypervisor) occur.
So I guess we can close this as current version, right?
*** This bug has been marked as a duplicate of bug 499835 ***
Resolving as duplicate of later bugzilla since the problem space is more narrowly defined in that bugzilla.
This bug was closed during 5.5 development and it's being removed from the internal tracking bugs (which are now for 5.6).