Bug 691487

Summary: VM resource stops working
Product: Red Hat Enterprise Linux 5 Reporter: Madison Kelly <mkelly>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.5CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-28 17:13:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
/etc/init.d/cluster.conf in use for this bug none

Description Madison Kelly 2011-03-28 16:52:55 UTC
Created attachment 488198 [details]
/etc/init.d/cluster.conf in use for this bug

Description of problem:

I created a <vm ...> resource for a running Xen VM and all was fine. I tested killing the VM and it was properly recovered on the other node. Then I tried killing the host node, but the recover failed. This was in part due to a misconfigured single-node service that tried to start on the surviving node (a set of scripts that were already running via a matching service).

At this point, the cluster was hung. I couldn't stop rgmanager at all and eventually had to kill -9 running services and reboot both cluster nodes. After this, I got the cluster re-assembled, but the VM service no longer worked. clustat showed the VM service as having no owning node despite being configured in a failover domain. The VM service was also set to not start with rgmanager, but I noticed in syslog that rgmanager looked for the VMs configuration anyway (which is on a GFS2 partition that isn't yet available at that stage of the cluster start).

[root@xenmaster013 ~]# clustat
Cluster Status for xencluster03 @ Mon Mar 28 11:51:19 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 xenmaster013.iplink.net                     1 Online, Local, rgmanager
 xenmaster014.iplink.net                     2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:x13_storage            xenmaster013.iplink.net        started       
 service:x14_storage            xenmaster014.iplink.net        started       
 vm:vm0002_c5_lz7_1             (none)                         disabled      

I deleted and recreated the VM service, both with the cluster running and with it stopped, and even tried delete->reboot->add and it still refused to start. In all cases, the error was:

[root@xenmaster013 ~]# clusvcadm -e vm:vm0002_c5_lz7_1 -m xenmaster013.iplink.net
Member xenmaster013.iplink.net trying to enable vm:vm0002_c5_lz7_1...Invalid operation for resource

In syslog:

Mar 28 12:06:57 xenmaster013 clurgmgrd[5860]: <notice> Resource Group Manager Starting 
Mar 28 12:06:58 xenmaster013 clurgmgrd: [5860]: <warning> Could not find vm0002_c5_lz7_1 or vm0002_c5_lz7_1.xml in search path /xen_shared/xen 
...
Mar 28 12:10:19 xenmaster013 clurgmgrd[5860]: <notice> Reconfiguring 
Mar 28 12:10:43 xenmaster013 ccsd[5796]: Update of cluster.conf complete (version 40 -> 41). 
Mar 28 12:10:49 xenmaster013 clurgmgrd[5860]: <notice> Reconfiguring 
Mar 28 12:10:50 xenmaster013 clurgmgrd[5860]: <notice> Initializing vm:vm0002_c5_lz7_1 
Mar 28 12:10:50 xenmaster013 clurgmgrd[5860]: <notice> vm:vm0002_c5_lz7_1 was added to the config, but I am not initializing it. 
Mar 28 12:11:11 xenmaster013 luci[5334]: Unable to retrieve batch 1201951477 status from xenmaster013.iplink.net:11111: module scheduled for execution
Mar 28 12:11:16 xenmaster013 luci[5334]: Unable to retrieve batch 1201951477 status from xenmaster013.iplink.net:11111: clusvcadm start failed to start vm0002_c5_lz7_1: 

I was able to finally get rgmanager to see the VM when I changed the VM's 'path=""' to a new directory. At this point, clustat now showed the owner of the resource as being the preferred node, but it still wouldn't start. 

[root@xenmaster013 ~]# clustat 
Cluster Status for xencluster03 @ Mon Mar 28 12:49:55 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 xenmaster013.iplink.net                     1 Online, Local, rgmanager
 xenmaster014.iplink.net                     2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:x13_storage            xenmaster013.iplink.net        started       
 service:x14_storage            xenmaster014.iplink.net        started       
 vm:vm0002_c5_lz7_1             (xenmaster013.iplink.net)      stopped       

[root@xenmaster013 ~]# clusvcadm -e vm:vm0002_c5_lz7_1 -m xenmaster013.iplink.net
Member xenmaster013.iplink.net trying to enable vm:vm0002_c5_lz7_1...Failure

In syslog:

Mar 28 12:50:20 xenmaster013 clurgmgrd[5845]: <warning> #70: Failed to relocate vm:vm0002_c5_lz7_1; restarting locally 
Mar 28 12:50:20 xenmaster013 clurgmgrd[5845]: <notice> Starting stopped service vm:vm0002_c5_lz7_1 
Mar 28 12:50:21 xenmaster013 kernel: device vif5.0 entered promiscuous mode
Mar 28 12:50:21 xenmaster013 kernel: ADDRCONF(NETDEV_UP): vif5.0: link is not ready
Mar 28 12:50:21 xenmaster013 clurgmgrd[5845]: <notice> Service vm:vm0002_c5_lz7_1 started 
Mar 28 12:50:21 xenmaster013 clurgmgrd[5845]: <notice> Stopping service vm:vm0002_c5_lz7_1 
Mar 28 12:50:23 xenmaster013 kernel: xenbr0: port 3(vif5.0) entering disabled state
Mar 28 12:50:23 xenmaster013 kernel: device vif5.0 left promiscuous mode
Mar 28 12:50:23 xenmaster013 kernel: xenbr0: port 3(vif5.0) entering disabled state
Mar 28 12:50:27 xenmaster013 clurgmgrd[5845]: <notice> Service vm:vm0002_c5_lz7_1 is stopped 

The cluster.conf is attached.

Version-Release number of selected component (if applicable):

cman-2.0.115-34.el5_5.4
rgmanager-2.0.52-6.el5.centos.8

How reproducible:

For me, 100% post-fail

Steps to Reproduce:
1. Please see above description.
2. 
3.
  
Actual results:

VM fails to restart

Expected results:

VM to start

Additional info:

Comment 1 Lon Hohberger 2011-03-28 17:13:07 UTC
The two services have exclusive="1" defined; this prevents the VM from starting.