From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1 Description of problem: Rgmanager will return a success even if the node that it is either failing over to or starting on does not have access to the virtual service image. clustat will show the virtual service as started. I did a simple recreation on our cluster2 in the lab. This is 3 node cluster. THe xen image is on GFS volume. I turned off the clvmd and gfs service on Node3. This worked and was successfull since Node2 has access to the image. $ clusvcadm -M vm:rhel5_apache -n cluster2-2.hb This failed because Node3 did not have access to the GFS volume. However, clustat says everything worked. $ clusvcadm -M vm:rhel5_apache -n cluster2-3.hb $ clustat Cluster Status for cluster2 @ Thu Jul 24 12:46:06 2008 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ cluster2-1.hb 1 Online, Local, rgmanager cluster2-2.hb 2 Online, rgmanager cluster2-3.hb 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:nfs (none) disabled vm:rhel5_apache cluster2-3.hb started --- [root@cluster2-3 ~]# xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1256 4 r----- 161.7 rhel5_apache -------------------------------------------------------------------------------- Node 3 had errors: -------------------------------------------------------------------------------- Node 3Jul 24 12:32:30 cluster2-3 clurgmgrd: [7876]: <err> stop: Could not match /dev/cluster/lv1 with a real device Jul 24 12:32:30 cluster2-3 clurgmgrd[7876]: <notice> stop on clusterfs "mygfs1" returned 2 (invalid argument(s)) Jul 24 12:32:42 cluster2-3 clurgmgrd[7876]: <notice> Recovering failed service vm:rhel5_apache Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <notice> start on vm "rhel5_apache" returned 1 (generic error) Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <warning> #68: Failed to start vm:rhel5_apache; return value: 1 Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <notice> Stopping service vm:rhel5_apache Jul 24 12:32:49 cluster2-3 clurgmgrd[7876]: <notice> Service vm:rhel5_apache is recovering Jul 24 12:35:13 cluster2-3 ccsd[6631]: Update of cluster.conf complete (version 9 -> 10). Jul 24 12:35:24 cluster2-3 clurgmgrd[7876]: <notice> Reconfiguring Jul 24 12:40:09 cluster2-3 kernel: tap tap-1-51712: 2 getting info Jul 24 12:40:10 cluster2-3 kernel: device vif1.0 entered promiscuous mode Jul 24 12:40:10 cluster2-3 kernel: ADDRCONF(NETDEV_UP): vif1.0: link is not ready Jul 24 12:40:10 cluster2-3 logger: /etc/xen/scripts/blktap: /gfs/images/rhel5_apache.img does not exist Jul 24 12:40:10 cluster2-3 clurgmgrd[7876]: <notice> vm:rhel5_apache is now running locally Jul 24 12:40:34 cluster2-3 kernel: blktap: ring-ref 8, event-channel 7, protocol 1 (x86_32-abi) Jul 24 12:40:34 cluster2-3 kernel: blktap: ring-ref 8, event-channel 7, protocol 1 (x86_32-abi) Jul 24 12:40:34 cluster2-3 kernel: ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready Jul 24 12:40:34 cluster2-3 kernel: xenbr0: port 3(vif1.0) entering learning state Jul 24 12:40:34 cluster2-3 kernel: xenbr0: topology change detected, propagating Jul 24 12:40:34 cluster2-3 kernel: xenbr0: port 3(vif1.0) entering forwarding state -------------------------------------------------------------------------------- It appears that rgmanager is not getting correct return code or just not checking. Version-Release number of selected component (if applicable): rgmanager-2.0.38-2.el5_2.1 How reproducible: Always Steps to Reproduce: 1.Setup xen cluster and create a xen image on a gfs volume 2.clusvcadm -e vm:rhel5_apache 3.On the node you are migrating a virtual service to: service gfs stop && service clvmd stop 4.clusvcadm -M vm:rhel5_apache -n cluster2-3.hb Actual Results: $ clustat clustat will show the service as started, when in fact it did not start because it did not have access to the image. Expected Results: Rgmanager should be able to handle this error instead of just relying on failover domain rules to prevent failover of virtual service to node that does not have access to image. Additional info:
So: * xend migration "worked" even though the image did not exist, * the migration completed successfully (as far as rgmanager could tell), but soon after, rgmanager tried to recover the service, but couldn't because the image did not exist. * The subsequent "restart" succeeded even though the image didn't exist.
*** This bug has been marked as a duplicate of bug 303111 ***