Bug 479737
Summary: | Difficult to reproduce rgmanager issues when using RHCS w/ Xen | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Zak Berrie <zbrown> | |
Component: | xen | Assignee: | Xen Maintainance List <xen-maint> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | low | |||
Version: | 5.3 | CC: | bburns, cchen, clalance, cluster-maint, dave.costakos, grimme, iannis, jmh, nenad, rbinkhor, schlegel, sghosh, xen-maint | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 610208 (view as bug list) | Environment: | ||
Last Closed: | 2010-07-01 18:41:30 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 412911 | |||
Bug Blocks: |
Description
Zak Berrie
2009-01-12 20:01:15 UTC
Here are some logs that illustrate item "a" from the bug report. In this case I have a 2 node cluster where I added 1 new VM yesterday (named 'mfpkg3-7d4c'). when I added this, it caused 2 other VMs (named 'buchanon' and 'mfpkg3-7d4b' to restart). Machine #1 logs from rgmanager: Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <notice> Reconfiguring Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <info> Loading Service Data Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <err> Error storing vm: Duplicate Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <info> Stopping changed resources. Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:oracle-lt1 Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:buchanon Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is recovering Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is recovering Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2 Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is stopped Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2 Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is stopped Jan 14 17:40:54 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:mfpkg3-7d4b Jan 14 17:41:00 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is recovering Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is now running on member 2 Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Restarting changed resources. Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Starting changed resources. Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> Initializing vm:mfpkg3-7d4c Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it. Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:oracle-lt2 is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4b is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:oracle-lt1 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4c is running on 2 Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:buchanon Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon started Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 started Jan 14 17:44:47 xen-test01 clurgmgrd[20356]: <notice> Recovering failed service vm:mfpkg3-7d4c Jan 14 17:44:50 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4c started Machine #2 logs from rgmanager: Jan 13 00:00:01 test-sun-4600 newsyslog[25442]: logfile turned over Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <notice> Reconfiguring Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <info> Loading Service Data Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Stopping changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Restarting changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Starting changed resources. Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> Initializing vm:mfpkg3-7d4c Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it. Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:lsf-rhel5 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:qcaetest32-rh4 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-3 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-4 is running on 1 Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:mfpkg3-7d4a is running on 1 Jan 14 17:39:53 test-sun-4600 clurgmgrd[736]: <notice> Starting stopped service vm:mfpkg3-7d4c Jan 14 17:39:57 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c started Jan 14 17:40:46 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0 Jan 14 17:40:47 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0 Jan 14 17:41:00 test-sun-4600 clurgmgrd[736]: <notice> Recovering failed service vm:mfpkg3-7d4b Jan 14 17:41:02 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4b started Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> status on vm "mfpkg3-7d4c" returned 1 (generic error) Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> Stopping service vm:mfpkg3-7d4c Jan 14 17:44:46 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is recovering Jan 14 17:44:50 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is now running on member 1 Jan 15 00:00:01 test-sun-4600 newsyslog[19165]: logfile turned over On item B apparently there is a "max_restarts" and "threshold" (not sure of the attribute names) config item for VM virtual services. However, using luci-0.12.0-7.el5, I do not see those options in the interface and I do not know where to look for their detailed documentation. As such, I can't implement that yet and I'm not sure in what version of rgmanager that those items will be made available. For item b: max_restarts="X" - Keep track of this many restarts. restart_expire_time="Y" - A restart is "forgotten" after this many seconds. <vm name="vm1" max_restarts="X" restart_expire_time="Y" ... /> <service name="service1" max_restarts="X" restart_expire_time="Y" ... /> Tolerate X restarts in Y seconds. After this tolerance is exceeded (i.e. on the X+1 failure in Y seconds), relocate the failed service or virtual machine to another node in the cluster. Just a short summary of the current plan after yesterday's call - Upgrade the test cluster to 5.3 to verify whether the cluster.conf update issues (restart of VM services) has indeed been resolved (item a) - Once verified the problem has been fixed either develop a plan for rolling upgrade of 8 node clusters or evaluate possible backport of RHEL5.3 fixes to RHEL5.2 CS - Verify (non)existence of max_restarts parameter in Conga/Luci (see update #2 it looks like it's not propagated into the Luci UI) for RHEL5.2 and RHEL5.3 - Look into short term options for resource check as pre-check for VM service migration (available memory on target host to accomodate VM) - Not related to this BZ but usability/operation issues to be researched - Better integration of guest provisioning with clustering (virt-install) - add newly created VM automatically to cluster framework - Improve cloning capabilities (virt-clone) - change/modify guest configuration during/after cloning (automated) - Jan Update: I took a RHEL 5.2 cluster and upgraded just the openais, cman and rgmanager RPMs and now CMAN will not start on the upgraded machine. A "cman_tool -d join" generates and openais error message: aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed. A full "yum update" to RHEL 5.3 from RHEL 5.2 hangs indefinitely on upgrading the "tog-pegasus" package. Had to manually remove tog-pegasus (and cluster-cim) and then updates occur. I've confirmed that the differing openais packages causes this problem. I did a full update on a single system in this cluster (including the kernel and openais) and encountered this. If I downgraded openais back to openais-0.80.3-15.el5 x86_64 (was 0.80.3-22.el5 x86_64). Then all is well. I believe there is chatter on the RHEL 5 lists about this problem as well, so I know I'm not the only one encountering this issue. Quick update. Initial testing on rgmanager in RHEL 5.3 seems good for "problem a". Testing on max_restarts and restart_expire_time also seems good. I'll need a couple more days to verify everything on rgmanager in RHEL 5.3, but kudos so far is due. Update: The "max_restarts" and "restart_expire_time" functionally work as expected. However, applying these 2 attributes to an existing VM causes rgmanager to restart the VM in both RHEL 5.3 and the RHEL 5.2 versions. Not really ideal since we have to take a downtime for every existing VM that needs this fix applied. Ok, thanks for the update. Just wanted to confirm that after 3 weeks of testing, I haven't seen the VM restart issue occur when updating /etc/cluster/cluster.conf as of yet. Thanks for all the updates. So there is another relevant bugzilla you should be aware of: Another VM restart issue (not related to the one fixed in 5.3): https://bugzilla.redhat.com/show_bug.cgi?id=490449 https://bugzilla.redhat.com/show_bug.cgi?id=491654 This is part of item (a), and has been verified to be fixed. Migration issue: https://bugzilla.redhat.com/show_bug.cgi?id=303111 Porting to virsh will allegedly fix many of the xm-related lack-of-resource problems. This should address item (d). This leaves item (c). FWIW item (d): https://bugzilla.redhat.com/show_bug.cgi?id=412911 I saw this as well. I am running a 5.2 cluster with four members, and upon running a ccs_tool update cluster.conf, I saw this in my log: May 15 15:12:34 tomskerritt ccsd[7265]: Update of cluster.conf complete (version 20 -> 21). May 15 15:12:45 tomskerritt clurgmgrd[8453]: <notice> Reconfiguring May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Initializing vm:emaitred May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> vm:emaitred was added to the config, but I am not initializing it. May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild02-i scsi is running on 3 May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild04 is running on 3 May 15 15:12:47 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:emaitred is running on 5 May 15 15:14:31 tomskerritt clurgmgrd[8453]: <notice> Recovering failed service vm:superdev May 15 15:14:32 tomskerritt kernel: device vif11.0 entered promiscuous mode May 15 15:14:32 tomskerritt kernel: ADDRCONF(NETDEV_UP): vif11.0: link is not ready May 15 15:14:32 tomskerritt clurgmgrd[8453]: <notice> Service vm:superdev started Needless to say, superdev /was/ running before. Chris, I think you hit the reconfiguration problem which was addressed in 5.3.z: https://rhn.redhat.com/errata/RHBA-2009-0415.html http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=491654 Is there a way to reproduce item (c) from comment #0 ? Effectively, without a reproducer (e.g. a method to make the VMs crash), there's not much we can do to fix it. Also, it may be substantially more appropriate to fix the reason the VMs are crashing! Actually, since we entirely use virsh and xm to manage VMs, item (c) is not very "fixable" without really digging in to things. I agree with comment #16 and #17 here. This needs to be fixed in the virt tools themselves. rgmanager is not causing vnc sessions and qemu-dm processes to be left around when the domU crashes. cc'ing bburns from the virt team to look at this. Since item c) is the only item left open in the original list of issues, and item c) is really a Xen issue I'm inclined to change the component to xen. Also, in order to figure out item c) we need reproducers or logs from when this happens. Without that there's not much we can do. Bill, what are your thoughts? Yes, I agree leaving qemu-dm running is something which should be fixed in Xen. Not sure about Xvnc processes as they are not started by Xen. Anyway, as Perry already mentioned, we would need logs and ideally reproducers. And I think it would be a great idea to make a new BZ for this issue rather than using this one as it covers several issues and is full of irrelevant data. The rgmanager specific issues have been addressed for some time (5.4 time frame). The remaining issue seems to be Xen specific. I've copied the relevant bits in to this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=610208 |