Description of problem: As reported by Qualcomm: a. Updates to cluster.conf causes existing Virtual Services to reboot for no apparent reason. This seems to happen sometimes when I either add a service or remove a service from clustering then update the cluster.conf file. I've seen this happen on manual edits where I run ccs_tool update afterward, but also on updates from luci. b. If a VM is in a bad state and can't startup, there appears to be no limit to the number of times rgmanager will try to start it. The result is that a VM many be restarted hundreds of times and always list as "recovering" in its state. Depending on the failure policy (disable, restart or relocate), this can have different behavior. "Restart" seems to be the worst because a single Xen machine tries over and over again to start a VM and gets into a "spinning" state. I've seen this crash xend before. Disable is safe, but then it hampers normal reboot operations on the VM and forces people to do stuff like that from the hypervisor itself (via clusvcadm or a luci interface) -- both are really unacceptable to me. c. When RGmanager VMs crash, it sometimes leaves qemu-dm and Xvnc processes hanging around on other machines in a cluster. This is bad because it seems to cause VM instability when the VM starts somewhere else. The processes live on, but since you can't see them in virsh list or xm list, it's tough to spot. usually the VMs end of up with corrupted /var or / partitions when this happens which (as you'll see from #4) is a pain. d. Migrations do not take into account resources on the machines before executing. this is a problem we've seen when we try to migrate a VM live to another host that doesn't have enough memory for it. The result is that the guest gets stopped (downtime) and then HA restarts it somewhere else (hopefully). RGmanager seems to lack any type of load or free memory information -- or at least it doesn't use it at all.
Here are some logs that illustrate item "a" from the bug report. In this case I have a 2 node cluster where I added 1 new VM yesterday (named 'mfpkg3-7d4c'). when I added this, it caused 2 other VMs (named 'buchanon' and 'mfpkg3-7d4b' to restart). Machine #1 logs from rgmanager: Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <notice> Reconfiguring Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <info> Loading Service Data Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <err> Error storing vm: Duplicate Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <info> Stopping changed resources. Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:oracle-lt1 Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:buchanon Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is recovering Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is recovering Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2 Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is stopped Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2 Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is stopped Jan 14 17:40:54 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:mfpkg3-7d4b Jan 14 17:41:00 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is recovering Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is now running on member 2 Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Restarting changed resources. Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Starting changed resources. Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> Initializing vm:mfpkg3-7d4c Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it. Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:oracle-lt2 is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4b is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:oracle-lt1 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4c is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:buchanon Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon started Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 started Jan 14 17:44:47 xen-test01 clurgmgrd[20356]: <notice> Recovering failed service vm:mfpkg3-7d4c Jan 14 17:44:50 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4c started Machine #2 logs from rgmanager: Jan 13 00:00:01 test-sun-4600 newsyslog[25442]: logfile turned over Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <notice> Reconfiguring Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <info> Loading Service Data Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Stopping changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Restarting changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Starting changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> Initializing vm:mfpkg3-7d4c Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it. Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:lsf-rhel5 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:qcaetest32-rh4 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-3 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-4 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:mfpkg3-7d4a is running on 1 Jan 14 17:39:53 test-sun-4600 clurgmgrd[736]: <notice> Starting stopped service vm:mfpkg3-7d4c Jan 14 17:39:57 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c started Jan 14 17:40:46 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0 Jan 14 17:40:47 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0 Jan 14 17:41:00 test-sun-4600 clurgmgrd[736]: <notice> Recovering failed service vm:mfpkg3-7d4b Jan 14 17:41:02 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4b started Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> status on vm "mfpkg3-7d4c" returned 1 (generic error) Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> Stopping service vm:mfpkg3-7d4c Jan 14 17:44:46 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is recovering Jan 14 17:44:50 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is now running on member 1 Jan 15 00:00:01 test-sun-4600 newsyslog[19165]: logfile turned over
On item B apparently there is a "max_restarts" and "threshold" (not sure of the attribute names) config item for VM virtual services. However, using luci-0.12.0-7.el5, I do not see those options in the interface and I do not know where to look for their detailed documentation. As such, I can't implement that yet and I'm not sure in what version of rgmanager that those items will be made available.
For item b: max_restarts="X" - Keep track of this many restarts. restart_expire_time="Y" - A restart is "forgotten" after this many seconds. <vm name="vm1" max_restarts="X" restart_expire_time="Y" ... /> <service name="service1" max_restarts="X" restart_expire_time="Y" ... /> Tolerate X restarts in Y seconds. After this tolerance is exceeded (i.e. on the X+1 failure in Y seconds), relocate the failed service or virtual machine to another node in the cluster.
Just a short summary of the current plan after yesterday's call - Upgrade the test cluster to 5.3 to verify whether the cluster.conf update issues (restart of VM services) has indeed been resolved (item a) - Once verified the problem has been fixed either develop a plan for rolling upgrade of 8 node clusters or evaluate possible backport of RHEL5.3 fixes to RHEL5.2 CS - Verify (non)existence of max_restarts parameter in Conga/Luci (see update #2 it looks like it's not propagated into the Luci UI) for RHEL5.2 and RHEL5.3 - Look into short term options for resource check as pre-check for VM service migration (available memory on target host to accomodate VM) - Not related to this BZ but usability/operation issues to be researched - Better integration of guest provisioning with clustering (virt-install) - add newly created VM automatically to cluster framework - Improve cloning capabilities (virt-clone) - change/modify guest configuration during/after cloning (automated) - Jan
Update: I took a RHEL 5.2 cluster and upgraded just the openais, cman and rgmanager RPMs and now CMAN will not start on the upgraded machine. A "cman_tool -d join" generates and openais error message: aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed. A full "yum update" to RHEL 5.3 from RHEL 5.2 hangs indefinitely on upgrading the "tog-pegasus" package. Had to manually remove tog-pegasus (and cluster-cim) and then updates occur.
I've confirmed that the differing openais packages causes this problem. I did a full update on a single system in this cluster (including the kernel and openais) and encountered this. If I downgraded openais back to openais-0.80.3-15.el5 x86_64 (was 0.80.3-22.el5 x86_64). Then all is well. I believe there is chatter on the RHEL 5 lists about this problem as well, so I know I'm not the only one encountering this issue.
Quick update. Initial testing on rgmanager in RHEL 5.3 seems good for "problem a". Testing on max_restarts and restart_expire_time also seems good. I'll need a couple more days to verify everything on rgmanager in RHEL 5.3, but kudos so far is due.
Update: The "max_restarts" and "restart_expire_time" functionally work as expected. However, applying these 2 attributes to an existing VM causes rgmanager to restart the VM in both RHEL 5.3 and the RHEL 5.2 versions. Not really ideal since we have to take a downtime for every existing VM that needs this fix applied.
Ok, thanks for the update.
Just wanted to confirm that after 3 weeks of testing, I haven't seen the VM restart issue occur when updating /etc/cluster/cluster.conf as of yet. Thanks for all the updates.
So there is another relevant bugzilla you should be aware of: Another VM restart issue (not related to the one fixed in 5.3): https://bugzilla.redhat.com/show_bug.cgi?id=490449 https://bugzilla.redhat.com/show_bug.cgi?id=491654 This is part of item (a), and has been verified to be fixed. Migration issue: https://bugzilla.redhat.com/show_bug.cgi?id=303111 Porting to virsh will allegedly fix many of the xm-related lack-of-resource problems. This should address item (d). This leaves item (c).
FWIW item (d): https://bugzilla.redhat.com/show_bug.cgi?id=412911
I saw this as well. I am running a 5.2 cluster with four members, and upon running a ccs_tool update cluster.conf, I saw this in my log: May 15 15:12:34 tomskerritt ccsd[7265]: Update of cluster.conf complete (version 20 -> 21). May 15 15:12:45 tomskerritt clurgmgrd[8453]: <notice> Reconfiguring May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Initializing vm:emaitred May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> vm:emaitred was added to the config, but I am not initializing it. May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild02-i scsi is running on 3 May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild04 is running on 3 May 15 15:12:47 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:emaitred is running on 5 May 15 15:14:31 tomskerritt clurgmgrd[8453]: <notice> Recovering failed service vm:superdev May 15 15:14:32 tomskerritt kernel: device vif11.0 entered promiscuous mode May 15 15:14:32 tomskerritt kernel: ADDRCONF(NETDEV_UP): vif11.0: link is not ready May 15 15:14:32 tomskerritt clurgmgrd[8453]: <notice> Service vm:superdev started Needless to say, superdev /was/ running before.
Chris, I think you hit the reconfiguration problem which was addressed in 5.3.z: https://rhn.redhat.com/errata/RHBA-2009-0415.html http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=491654
Is there a way to reproduce item (c) from comment #0 ? Effectively, without a reproducer (e.g. a method to make the VMs crash), there's not much we can do to fix it. Also, it may be substantially more appropriate to fix the reason the VMs are crashing!
Actually, since we entirely use virsh and xm to manage VMs, item (c) is not very "fixable" without really digging in to things.
I agree with comment #16 and #17 here. This needs to be fixed in the virt tools themselves. rgmanager is not causing vnc sessions and qemu-dm processes to be left around when the domU crashes. cc'ing bburns from the virt team to look at this. Since item c) is the only item left open in the original list of issues, and item c) is really a Xen issue I'm inclined to change the component to xen. Also, in order to figure out item c) we need reproducers or logs from when this happens. Without that there's not much we can do. Bill, what are your thoughts?
Yes, I agree leaving qemu-dm running is something which should be fixed in Xen. Not sure about Xvnc processes as they are not started by Xen. Anyway, as Perry already mentioned, we would need logs and ideally reproducers. And I think it would be a great idea to make a new BZ for this issue rather than using this one as it covers several issues and is full of irrelevant data.
The rgmanager specific issues have been addressed for some time (5.4 time frame). The remaining issue seems to be Xen specific. I've copied the relevant bits in to this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=610208