Bug 490863 - Failback fails, kills rgmanager
Failback fails, kills rgmanager
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.2
x86_64 Linux
low Severity high
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-18 07:36 EDT by Janne Peltonen
Modified: 2009-04-16 18:56 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-03-18 10:42:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Janne Peltonen 2009-03-18 07:36:22 EDT
Description of problem:

If I upgrade the rgmanager package to version 2.0.38-2, failback fails and kills rgmanager:

--clip--
Mar 18 11:51:08 scn1 clurgmgrd[11825]: <notice> Resource Group Manager Starting 
Mar 18 11:51:40 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 3 sz=0 CTX 0x15390a00 
Mar 18 11:51:41 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 3 sz=0 CTX 0x15391b20 
Mar 18 11:51:42 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 2 sz=0 CTX 0x15391b20 
Mar 18 11:51:43 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 2 sz=0 CTX 0x153928d0 
Mar 18 11:51:44 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 4 sz=0 CTX 0x153928d0 
Mar 18 11:51:45 scn1 clurgmgrd[11825]: <err> #37: Error receiving header from 4 sz=0 CTX 0x153947f0 
Mar 18 11:52:13 scn1 rgmanager: [17741]: <notice> Shutting down Cluster Service Manager... 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutting down 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutting down 
Mar 18 11:52:13 scn1 clurgmgrd[11825]: <notice> Shutdown complete, exiting 
Mar 18 11:52:13 scn1 rgmanager: [17741]: <notice> Cluster Service Manager is stopped. 
--clip--


Version-Release number of selected component (if applicable):

2.0.38-2

How reproducible:

Steps to Reproduce:
1. Create some services in a prioritized failover domain in RHEL 5.1 (rgmanager 2.0.31-1)

2. Start the services

3. Update the rgmanager package to version 2.0.38-2 on the highest-priority node

4. Stop rgmanager, (clvmd, ) cman and restart them
  
Actual results:

cman and clvmd start as expected, rgmanager also starts, but when the services try to fail back to this current better node, they fail and rgmanager gets killed (see above for log)

Expected results:

The services should fail back nicely. They do, if I downgrade rgmanager back to version 2.0.31-1.

Additional info:

I'm actually using Centos, haven't tried this with a real RHEL yet since all our licenses are in use in production servers.
Comment 1 Lon Hohberger 2009-03-18 10:42:52 EDT
Well, that sounds like a bug, so it really doesn't matter whether you're using Debian, Fedora, RHEL, CentOS, or otherwise.

What I think you hit is related to rgmanager running out of descriptors:

https://bugzilla.redhat.com/show_bug.cgi?id=461956

This patch was included in the RHEL5.3 release:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=50dc172c12f728ebb5916e2059b01404d94dd066

Basically, after an event (joining/leaving the cluster, starting/stopping rgmanager, etc), rgmanager could run itself out of connection descriptors because it would actually go to sleep with a lock where it shouldn't have.

This produced a wide range of strange behaviors, and easily could have produced the problem you're seeing.

I would check the 5.3 release 2.0.46(?) of rgmanager; I'm pretty sure this is fixed.

Note You need to log in before you can comment on or make changes to this bug.