Bug 479737

Summary:	Difficult to reproduce rgmanager issues when using RHCS w/ Xen
Product:	Red Hat Enterprise Linux 5	Reporter:	Zak Berrie <zbrown>
Component:	xen	Assignee:	Xen Maintainance List <xen-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.3	CC:	bburns, cchen, clalance, cluster-maint, dave.costakos, grimme, iannis, jmh, nenad, rbinkhor, schlegel, sghosh, xen-maint
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	610208 (view as bug list)		Environment:
Last Closed:	2010-07-01 18:41:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	412911
Bug Blocks:

Description Zak Berrie 2009-01-12 20:01:15 UTC

Description of problem:
 
As reported by Qualcomm:

  a. Updates to cluster.conf causes existing Virtual Services to reboot for no apparent reason.  This seems to happen sometimes when I either add a service or remove a service from clustering then update the cluster.conf file.  I've seen this happen on manual edits where I run ccs_tool update afterward, but also on updates from luci.

  b. If a VM is in a bad state and can't startup, there appears to be no limit to the number of times rgmanager will try to start it.  The result is that a VM many be restarted hundreds of times and always list as "recovering" in its state.  Depending on the failure policy (disable, restart or relocate), this can have different behavior.  "Restart" seems to be the worst because a single Xen machine tries over and over again to start a VM and gets into a "spinning" state.  I've seen this crash xend before.  Disable is safe, but then it hampers normal reboot operations on the VM and forces people to do stuff like that from the hypervisor itself (via clusvcadm or a luci interface) -- both are really unacceptable to me.

 c. When RGmanager VMs crash, it sometimes leaves qemu-dm and Xvnc processes hanging around on other machines in a cluster.  This is bad because it seems to cause VM instability when the VM starts somewhere else.  The processes live on, but since you can't see them in virsh list or xm list, it's tough to spot.  usually the VMs end of up with corrupted /var or / partitions when this happens which (as you'll see from #4) is a pain.

 d. Migrations do not take into account resources on the machines before executing.  this is a problem we've seen when we try to migrate a VM live to another host that doesn't have enough memory for it.  The result is that the guest gets stopped (downtime) and then HA restarts it somewhere else (hopefully).  RGmanager seems to lack any type of load or free memory information -- or at least it doesn't use it at all.

Comment 1 Zak Berrie 2009-01-15 16:45:46 UTC

Here are some logs that illustrate item "a" from the bug report.  In this case I have a 2 node cluster where I added 1 new VM yesterday (named 'mfpkg3-7d4c').  when I added this, it caused 2 other VMs (named 'buchanon' and 'mfpkg3-7d4b' to restart).

Machine #1 logs from rgmanager:

Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <notice> Reconfiguring
Jan 14 17:39:50 xen-test01 clurgmgrd[20356]: <info> Loading Service Data
Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <err> Error storing vm: Duplicate
Jan 14 17:39:51 xen-test01 clurgmgrd[20356]: <info> Stopping changed resources.
Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:oracle-lt1
Jan 14 17:40:34 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:buchanon
Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is recovering
Jan 14 17:40:40 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is recovering
Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2
Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 is stopped
Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <err> #58: Failed opening connection to member #2
Jan 14 17:40:43 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon is stopped
Jan 14 17:40:54 xen-test01 clurgmgrd[20356]: <notice> Stopping service vm:mfpkg3-7d4b
Jan 14 17:41:00 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is recovering
Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4b is now running on member 2
Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Restarting changed resources.
Jan 14 17:41:02 xen-test01 clurgmgrd[20356]: <info> Starting changed resources.
Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> Initializing vm:mfpkg3-7d4c
Jan 14 17:41:03 xen-test01 clurgmgrd[20356]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it.
Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:oracle-lt2 is running on 2
Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4b is running on 2
Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:oracle-lt1
Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Migration: vm:mfpkg3-7d4c is running on 2
Jan 14 17:41:05 xen-test01 clurgmgrd[20356]: <notice> Starting stopped service vm:buchanon
Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:buchanon started
Jan 14 17:41:11 xen-test01 clurgmgrd[20356]: <notice> Service vm:oracle-lt1 started
Jan 14 17:44:47 xen-test01 clurgmgrd[20356]: <notice> Recovering failed service vm:mfpkg3-7d4c
Jan 14 17:44:50 xen-test01 clurgmgrd[20356]: <notice> Service vm:mfpkg3-7d4c started

Machine #2 logs from rgmanager:

Jan 13 00:00:01 test-sun-4600 newsyslog[25442]: logfile turned over
Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <notice> Reconfiguring
Jan 14 17:39:49 test-sun-4600 clurgmgrd[736]: <info> Loading Service Data
Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Stopping changed resources.
Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Restarting changed resources.
Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <info> Starting changed resources.
Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> Initializing vm:mfpkg3-7d4c
Jan 14 17:39:51 test-sun-4600 clurgmgrd[736]: <notice> vm:mfpkg3-7d4c was added to the config, but I am not initializing it.
Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:lsf-rhel5 is running on 1
Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:qcaetest32-rh4 is running on 1
Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-3 is running on 1
Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:atg-android-4 is running on 1
Jan 14 17:39:52 test-sun-4600 clurgmgrd[736]: <notice> Migration: vm:mfpkg3-7d4a is running on 1
Jan 14 17:39:53 test-sun-4600 clurgmgrd[736]: <notice> Starting stopped service vm:mfpkg3-7d4c
Jan 14 17:39:57 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c started
Jan 14 17:40:46 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0
Jan 14 17:40:47 test-sun-4600 clurgmgrd[736]: <err> #37: Error receiving header from 1 sz=0 CTX 0x5d964a0
Jan 14 17:41:00 test-sun-4600 clurgmgrd[736]: <notice> Recovering failed service vm:mfpkg3-7d4b
Jan 14 17:41:02 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4b started
Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> status on vm "mfpkg3-7d4c" returned 1 (generic error)
Jan 14 17:44:40 test-sun-4600 clurgmgrd[736]: <notice> Stopping service vm:mfpkg3-7d4c
Jan 14 17:44:46 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is recovering
Jan 14 17:44:50 test-sun-4600 clurgmgrd[736]: <notice> Service vm:mfpkg3-7d4c is now running on member 1
Jan 15 00:00:01 test-sun-4600 newsyslog[19165]: logfile turned over

Comment 2 Dave 2009-01-21 21:21:54 UTC

On item B apparently there is a "max_restarts" and "threshold" (not sure of the attribute names) config item for VM virtual services.

However, using luci-0.12.0-7.el5, I do not see those options in the interface and I do not know where to look for their detailed documentation.  As such, I can't implement that yet and I'm not sure in what version of rgmanager that those items will be made available.

Comment 3 Lon Hohberger 2009-01-21 22:07:00 UTC

For item b:

max_restarts="X" - Keep track of this many restarts.

restart_expire_time="Y" - A restart is "forgotten" after this many seconds.

  <vm name="vm1" max_restarts="X" restart_expire_time="Y" ... />
  <service name="service1" max_restarts="X" restart_expire_time="Y" ... />

Tolerate X restarts in Y seconds.  After this tolerance is exceeded (i.e. on the X+1 failure in Y seconds), relocate the failed service or virtual machine to another node in the cluster.

Comment 4 Jan Mark Holzer 2009-01-22 13:47:20 UTC

Just a short summary of the current plan after yesterday's call

- Upgrade the test cluster to 5.3 to verify whether the
  cluster.conf update issues (restart of VM services)
  has indeed been resolved (item a)
     - Once verified the problem has been fixed either 
       develop a plan for rolling upgrade of 8 node clusters
       or evaluate possible backport of RHEL5.3 fixes to RHEL5.2 CS

- Verify (non)existence of max_restarts parameter in Conga/Luci (see update #2
  it looks like it's not propagated into the Luci UI) for RHEL5.2
  and RHEL5.3

- Look into short term options for resource check as pre-check for
  VM service migration (available memory on target host to accomodate
  VM)

- Not related to this BZ but usability/operation issues to be researched
   - Better integration of guest provisioning with clustering (virt-install)
      - add newly created VM automatically to cluster framework
   - Improve cloning capabilities (virt-clone)
      - change/modify guest configuration during/after cloning (automated)


- Jan

Comment 5 Dave 2009-01-27 22:44:46 UTC

Update: I took a RHEL 5.2 cluster and upgraded just the openais, cman and rgmanager RPMs and now CMAN will not start on the upgraded machine.

A "cman_tool -d join" generates and openais error message:

aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed.

A full "yum update" to RHEL 5.3 from RHEL 5.2 hangs indefinitely on upgrading the "tog-pegasus" package.  Had to manually remove tog-pegasus (and cluster-cim) and then updates occur.

Comment 6 Dave 2009-01-28 17:12:21 UTC

I've confirmed that the differing openais packages causes this problem. 

I did a full update on a single system in this cluster (including the kernel and openais) and encountered this.  If I downgraded openais back to openais-0.80.3-15.el5 x86_64 (was 0.80.3-22.el5 x86_64).  Then all is well.

I believe there is chatter on the RHEL 5 lists about this problem as well, so I know I'm not the only one encountering this issue.

Comment 8 Dave 2009-01-29 23:56:08 UTC

Quick update.  Initial testing on rgmanager in RHEL 5.3 seems good for "problem a".

Testing on max_restarts and restart_expire_time also seems good.

I'll need a couple more days to verify everything on rgmanager in RHEL 5.3, but kudos so far is due.

Comment 9 Dave 2009-01-30 00:33:21 UTC

Update:

The "max_restarts" and "restart_expire_time" functionally work as expected.  However, applying these 2 attributes to an existing VM causes rgmanager to restart the VM in both RHEL 5.3 and the RHEL 5.2 versions.  Not really ideal since we have to take a downtime for every existing VM that needs this fix applied.

Comment 10 Lon Hohberger 2009-01-30 20:51:19 UTC

Ok, thanks for the update.

Comment 11 Dave 2009-02-23 22:26:57 UTC

Just wanted to confirm that after 3 weeks of testing, I haven't seen the VM restart issue occur when updating /etc/cluster/cluster.conf as of yet.

Thanks for all the updates.

Comment 12 Lon Hohberger 2009-04-02 16:57:55 UTC

So there is another relevant bugzilla you should be aware of:

Another VM restart issue (not related to the one fixed in 5.3):

https://bugzilla.redhat.com/show_bug.cgi?id=490449
https://bugzilla.redhat.com/show_bug.cgi?id=491654

This is part of item (a), and has been verified to be fixed.

Migration issue:

https://bugzilla.redhat.com/show_bug.cgi?id=303111

Porting to virsh will allegedly fix many of the xm-related lack-of-resource problems.  This should address item (d).


This leaves item (c).

Comment 13 Lon Hohberger 2009-04-02 20:24:58 UTC

FWIW item (d):

https://bugzilla.redhat.com/show_bug.cgi?id=412911

Comment 14 Chris Chen 2009-05-15 23:09:46 UTC

I saw this as well. I am running a 5.2 cluster with four members, and upon running a ccs_tool update cluster.conf, I saw this in my log:

May 15 15:12:34 tomskerritt ccsd[7265]: Update of cluster.conf complete (version
 20 -> 21). 
May 15 15:12:45 tomskerritt clurgmgrd[8453]: <notice> Reconfiguring 
May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Initializing vm:emaitred 
May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> vm:emaitred was added to the config, but I am not initializing it. 
May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild02-i
scsi is running on 3 
May 15 15:12:46 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:radbuild04 is running on 3 
May 15 15:12:47 tomskerritt clurgmgrd[8453]: <notice> Migration: vm:emaitred is 
running on 5 
May 15 15:14:31 tomskerritt clurgmgrd[8453]: <notice> Recovering failed service 
vm:superdev 
May 15 15:14:32 tomskerritt kernel: device vif11.0 entered promiscuous mode
May 15 15:14:32 tomskerritt kernel: ADDRCONF(NETDEV_UP): vif11.0: link is not ready
May 15 15:14:32 tomskerritt clurgmgrd[8453]: <notice> Service vm:superdev started

Needless to say, superdev /was/ running before.

Comment 15 Lon Hohberger 2009-05-20 14:27:19 UTC

Chris, I think you hit the reconfiguration problem which was addressed in 5.3.z:

https://rhn.redhat.com/errata/RHBA-2009-0415.html
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=491654

Comment 16 Lon Hohberger 2009-09-22 15:52:40 UTC

Is there a way to reproduce item (c) from comment #0 ?  Effectively, without a reproducer (e.g. a method to make the VMs crash), there's not much we can do to fix it.

Also, it may be substantially more appropriate to fix the reason the VMs are crashing!

Comment 17 Lon Hohberger 2009-09-29 19:12:24 UTC

Actually, since we entirely use virsh and xm to manage VMs, item (c) is not very "fixable" without really digging in to things.

Comment 18 Perry Myers 2009-10-22 02:21:32 UTC

I agree with comment #16 and #17 here.  This needs to be fixed in the virt tools themselves.  rgmanager is not causing vnc sessions and qemu-dm processes to be left around when the domU crashes.  cc'ing bburns from the virt team to look at this.  Since item c) is the only item left open in the original list of issues, and item c) is really a Xen issue I'm inclined to change the component to xen.

Also, in order to figure out item c) we need reproducers or logs from when this happens.  Without that there's not much we can do.

Bill, what are your thoughts?

Comment 20 Jiri Denemark 2009-10-23 07:27:59 UTC

Yes, I agree leaving qemu-dm running is something which should be fixed in Xen. Not sure about Xvnc processes as they are not started by Xen. Anyway, as Perry already mentioned, we would need logs and ideally reproducers. And I think it would be a great idea to make a new BZ for this issue rather than using this one as it covers several issues and is full of irrelevant data.

Comment 23 Lon Hohberger 2010-07-01 18:41:30 UTC

The rgmanager specific issues have been addressed for some time (5.4 time frame).  The remaining issue seems to be Xen specific.

I've copied the relevant bits in to this bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=610208