456556 – rgmanager returns success when xen image fails to load

Bug 456556 - rgmanager returns success when xen image fails to load

Summary: rgmanager returns success when xen image fails to load

Keywords:
Status:	CLOSED DUPLICATE of bug 303111
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-07-24 16:56 UTC by Shane Bradley
Modified:	2009-04-16 22:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-09-08 17:21:02 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shane Bradley 2008-07-24 16:56:44 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1

Description of problem:
Rgmanager will return a success even if the node that it is either failing over to or starting on does not have access to the virtual service image. clustat will show the virtual service as started.

I did a simple recreation on our cluster2 in the lab.

This is 3 node cluster. THe xen image is on GFS volume. I turned off the clvmd and gfs service on Node3.

This worked and was successfull since Node2 has access to the image.
$ clusvcadm -M vm:rhel5_apache -n cluster2-2.hb

This failed because Node3 did not have access to the GFS volume. However, clustat says everything worked.

$ clusvcadm -M vm:rhel5_apache -n cluster2-3.hb
$ clustat
Cluster Status for cluster2 @ Thu Jul 24 12:46:06 2008
Member Status: Quorate

Member Name                                                     ID   Status
------ ----                                                     ---- ------
cluster2-1.hb                                                       1 Online, Local, rgmanager
cluster2-2.hb                                                       2 Online, rgmanager
cluster2-3.hb                                                       3 Online, rgmanager

Service Name                                                     Owner (Last)                                                     State
------- ----                                                     ----- ------                                                     -----
service:nfs                                                      (none)                                                           disabled
vm:rhel5_apache                                                  cluster2-3.hb                                                    started


---

[root@cluster2-3 ~]# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     1256     4 r-----    161.7
rhel5_apache   

--------------------------------------------------------------------------------
Node 3 had errors:
--------------------------------------------------------------------------------
Node 3Jul 24 12:32:30 cluster2-3 clurgmgrd: [7876]: <err> stop: Could not match /dev/cluster/lv1 with a real device
Jul 24 12:32:30 cluster2-3 clurgmgrd[7876]: <notice> stop on clusterfs "mygfs1" returned 2 (invalid argument(s))
Jul 24 12:32:42 cluster2-3 clurgmgrd[7876]: <notice> Recovering failed service vm:rhel5_apache
Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <notice> start on vm "rhel5_apache" returned 1 (generic error)
Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <warning> #68: Failed to start vm:rhel5_apache; return value: 1
Jul 24 12:32:43 cluster2-3 clurgmgrd[7876]: <notice> Stopping service vm:rhel5_apache
Jul 24 12:32:49 cluster2-3 clurgmgrd[7876]: <notice> Service vm:rhel5_apache is recovering
Jul 24 12:35:13 cluster2-3 ccsd[6631]: Update of cluster.conf complete (version 9 -> 10).
Jul 24 12:35:24 cluster2-3 clurgmgrd[7876]: <notice> Reconfiguring
Jul 24 12:40:09 cluster2-3 kernel: tap tap-1-51712: 2 getting info
Jul 24 12:40:10 cluster2-3 kernel: device vif1.0 entered promiscuous mode
Jul 24 12:40:10 cluster2-3 kernel: ADDRCONF(NETDEV_UP): vif1.0: link is not ready
Jul 24 12:40:10 cluster2-3 logger: /etc/xen/scripts/blktap: /gfs/images/rhel5_apache.img does not exist
Jul 24 12:40:10 cluster2-3 clurgmgrd[7876]: <notice> vm:rhel5_apache is now running locally
Jul 24 12:40:34 cluster2-3 kernel: blktap: ring-ref 8, event-channel 7, protocol 1 (x86_32-abi)
Jul 24 12:40:34 cluster2-3 kernel: blktap: ring-ref 8, event-channel 7, protocol 1 (x86_32-abi)
Jul 24 12:40:34 cluster2-3 kernel: ADDRCONF(NETDEV_CHANGE): vif1.0: link becomes ready
Jul 24 12:40:34 cluster2-3 kernel: xenbr0: port 3(vif1.0) entering learning state
Jul 24 12:40:34 cluster2-3 kernel: xenbr0: topology change detected, propagating
Jul 24 12:40:34 cluster2-3 kernel: xenbr0: port 3(vif1.0) entering forwarding state

--------------------------------------------------------------------------------

It appears that rgmanager is not getting correct return code or just not checking. 



Version-Release number of selected component (if applicable):
rgmanager-2.0.38-2.el5_2.1

How reproducible:
Always


Steps to Reproduce:
1.Setup xen cluster and create a xen image on a gfs volume
2.clusvcadm -e vm:rhel5_apache 
3.On the node you are migrating a virtual service to: service gfs stop && service clvmd stop
4.clusvcadm -M vm:rhel5_apache -n cluster2-3.hb


Actual Results:
$ clustat 

clustat will show the service as started, when in fact it did not start because it did not have access to the image.


Expected Results:
Rgmanager should be able to handle this error instead of just relying on failover domain rules to prevent failover of virtual service to node that does not have access to image.

Additional info:

Comment 1 Lon Hohberger 2008-07-29 18:05:23 UTC

So:

* xend migration "worked" even though the image did not exist,

* the migration completed successfully (as far as rgmanager could tell), but
soon after, rgmanager tried to recover the service, but couldn't because the
image did not exist.

* The subsequent "restart" succeeded even though the image didn't exist.

Comment 2 Lon Hohberger 2008-09-08 17:21:02 UTC


*** This bug has been marked as a duplicate of bug 303111 ***

Note You need to log in before you can comment on or make changes to this bug.