Bug 490863

Summary:	Failback fails, kills rgmanager
Product:	Red Hat Enterprise Linux 5	Reporter:	Janne Peltonen <janne.peltonen>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.2	CC:	cluster-maint, edamato
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-03-18 14:42:52 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Janne Peltonen 2009-03-18 11:36:22 UTC

Description of problem:

If I upgrade the rgmanager package to version 2.0.38-2, failback fails and kills rgmanager:

--clip--
Mar 18 11:51:08 scn1 clurgmgrd[11825]: <notice> Resource Group Manager Starting 
Mar 18 11:51:40 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 3 sz=0 CTX 0x15390a00 
Mar 18 11:51:41 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 3 sz=0 CTX 0x15391b20 
Mar 18 11:51:42 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 2 sz=0 CTX 0x15391b20 
Mar 18 11:51:43 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 2 sz=0 CTX 0x153928d0 
Mar 18 11:51:44 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 4 sz=0 CTX 0x153928d0 
Mar 18 11:51:45 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 4 sz=0 CTX 0x153947f0 
Mar 18 11:52:13 scn1 rgmanager: [17741]: <notice> Shutting down Cluster Service Manager... 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutting down 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutting down 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutdown complete, exiting 
Mar 18 11:52:13 scn1 rgmanager: [17741]: <notice> Cluster Service Manager is stopped. 
--clip--


Version-Release number of selected component (if applicable):

2.0.38-2

How reproducible:

Steps to Reproduce:
1. Create some services in a prioritized failover domain in RHEL 5.1 (rgmanager 2.0.31-1)

2. Start the services

3. Update the rgmanager package to version 2.0.38-2 on the highest-priority node

4. Stop rgmanager, (clvmd, ) cman and restart them
  
Actual results:

cman and clvmd start as expected, rgmanager also starts, but when the services try to fail back to this current better node, they fail and rgmanager gets killed (see above for log)

Expected results:

The services should fail back nicely. They do, if I downgrade rgmanager back to version 2.0.31-1.

Additional info:

I'm actually using Centos, haven't tried this with a real RHEL yet since all our licenses are in use in production servers.

Comment 1 Lon Hohberger 2009-03-18 14:42:52 UTC

Well, that sounds like a bug, so it really doesn't matter whether you're using Debian, Fedora, RHEL, CentOS, or otherwise.

What I think you hit is related to rgmanager running out of descriptors:

https://bugzilla.redhat.com/show_bug.cgi?id=461956

This patch was included in the RHEL5.3 release:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=50dc172c12f728ebb5916e2059b01404d94dd066

Basically, after an event (joining/leaving the cluster, starting/stopping rgmanager, etc), rgmanager could run itself out of connection descriptors because it would actually go to sleep with a lock where it shouldn't have.

This produced a wide range of strange behaviors, and easily could have produced the problem you're seeing.

I would check the 5.3 release 2.0.46(?) of rgmanager; I'm pretty sure this is fixed.