| Summary: | VM resource stops working | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Madison Kelly <mkelly> | ||||
| Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 5.5 | CC: | cluster-maint, edamato | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-03-28 17:13:07 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
The two services have exclusive="1" defined; this prevents the VM from starting. |
Created attachment 488198 [details] /etc/init.d/cluster.conf in use for this bug Description of problem: I created a <vm ...> resource for a running Xen VM and all was fine. I tested killing the VM and it was properly recovered on the other node. Then I tried killing the host node, but the recover failed. This was in part due to a misconfigured single-node service that tried to start on the surviving node (a set of scripts that were already running via a matching service). At this point, the cluster was hung. I couldn't stop rgmanager at all and eventually had to kill -9 running services and reboot both cluster nodes. After this, I got the cluster re-assembled, but the VM service no longer worked. clustat showed the VM service as having no owning node despite being configured in a failover domain. The VM service was also set to not start with rgmanager, but I noticed in syslog that rgmanager looked for the VMs configuration anyway (which is on a GFS2 partition that isn't yet available at that stage of the cluster start). [root@xenmaster013 ~]# clustat Cluster Status for xencluster03 @ Mon Mar 28 11:51:19 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ xenmaster013.iplink.net 1 Online, Local, rgmanager xenmaster014.iplink.net 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:x13_storage xenmaster013.iplink.net started service:x14_storage xenmaster014.iplink.net started vm:vm0002_c5_lz7_1 (none) disabled I deleted and recreated the VM service, both with the cluster running and with it stopped, and even tried delete->reboot->add and it still refused to start. In all cases, the error was: [root@xenmaster013 ~]# clusvcadm -e vm:vm0002_c5_lz7_1 -m xenmaster013.iplink.net Member xenmaster013.iplink.net trying to enable vm:vm0002_c5_lz7_1...Invalid operation for resource In syslog: Mar 28 12:06:57 xenmaster013 clurgmgrd[5860]: <notice> Resource Group Manager Starting Mar 28 12:06:58 xenmaster013 clurgmgrd: [5860]: <warning> Could not find vm0002_c5_lz7_1 or vm0002_c5_lz7_1.xml in search path /xen_shared/xen ... Mar 28 12:10:19 xenmaster013 clurgmgrd[5860]: <notice> Reconfiguring Mar 28 12:10:43 xenmaster013 ccsd[5796]: Update of cluster.conf complete (version 40 -> 41). Mar 28 12:10:49 xenmaster013 clurgmgrd[5860]: <notice> Reconfiguring Mar 28 12:10:50 xenmaster013 clurgmgrd[5860]: <notice> Initializing vm:vm0002_c5_lz7_1 Mar 28 12:10:50 xenmaster013 clurgmgrd[5860]: <notice> vm:vm0002_c5_lz7_1 was added to the config, but I am not initializing it. Mar 28 12:11:11 xenmaster013 luci[5334]: Unable to retrieve batch 1201951477 status from xenmaster013.iplink.net:11111: module scheduled for execution Mar 28 12:11:16 xenmaster013 luci[5334]: Unable to retrieve batch 1201951477 status from xenmaster013.iplink.net:11111: clusvcadm start failed to start vm0002_c5_lz7_1: I was able to finally get rgmanager to see the VM when I changed the VM's 'path=""' to a new directory. At this point, clustat now showed the owner of the resource as being the preferred node, but it still wouldn't start. [root@xenmaster013 ~]# clustat Cluster Status for xencluster03 @ Mon Mar 28 12:49:55 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ xenmaster013.iplink.net 1 Online, Local, rgmanager xenmaster014.iplink.net 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:x13_storage xenmaster013.iplink.net started service:x14_storage xenmaster014.iplink.net started vm:vm0002_c5_lz7_1 (xenmaster013.iplink.net) stopped [root@xenmaster013 ~]# clusvcadm -e vm:vm0002_c5_lz7_1 -m xenmaster013.iplink.net Member xenmaster013.iplink.net trying to enable vm:vm0002_c5_lz7_1...Failure In syslog: Mar 28 12:50:20 xenmaster013 clurgmgrd[5845]: <warning> #70: Failed to relocate vm:vm0002_c5_lz7_1; restarting locally Mar 28 12:50:20 xenmaster013 clurgmgrd[5845]: <notice> Starting stopped service vm:vm0002_c5_lz7_1 Mar 28 12:50:21 xenmaster013 kernel: device vif5.0 entered promiscuous mode Mar 28 12:50:21 xenmaster013 kernel: ADDRCONF(NETDEV_UP): vif5.0: link is not ready Mar 28 12:50:21 xenmaster013 clurgmgrd[5845]: <notice> Service vm:vm0002_c5_lz7_1 started Mar 28 12:50:21 xenmaster013 clurgmgrd[5845]: <notice> Stopping service vm:vm0002_c5_lz7_1 Mar 28 12:50:23 xenmaster013 kernel: xenbr0: port 3(vif5.0) entering disabled state Mar 28 12:50:23 xenmaster013 kernel: device vif5.0 left promiscuous mode Mar 28 12:50:23 xenmaster013 kernel: xenbr0: port 3(vif5.0) entering disabled state Mar 28 12:50:27 xenmaster013 clurgmgrd[5845]: <notice> Service vm:vm0002_c5_lz7_1 is stopped The cluster.conf is attached. Version-Release number of selected component (if applicable): cman-2.0.115-34.el5_5.4 rgmanager-2.0.52-6.el5.centos.8 How reproducible: For me, 100% post-fail Steps to Reproduce: 1. Please see above description. 2. 3. Actual results: VM fails to restart Expected results: VM to start Additional info: