Bug 462679 - rgmanager fails to connect to other members when using recovery=restart policy
rgmanager fails to connect to other members when using recovery=restart policy
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.2
All Linux
medium Severity medium
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
: 481133 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-18 04:48 EDT by Herbert L. Plankl
Modified: 2010-10-23 00:36 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:56:43 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Herbert L. Plankl 2008-09-18 04:48:37 EDT
Description of problem:
rgmanager tries to restart a failed service, which returns exit 1 while starting. So the cluster should relocate the service (recovery=restart) to another node, but that doesn't work:

Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 


Version-Release number of selected component (if applicable):
EL5.2 x86_64 + latest updates

How reproducible:
* 2-node-cluster
* following domain
		<failoverdomains>
			<failoverdomain name="tom" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="tom" priority="1"/>
				<failoverdomainnode name="jerry" priority="2"/>
			</failoverdomain>
		</failoverdomains>
* following service
		<service autostart="1" domain="tom" exclusive="0" name="samba" recovery="restart">
			<script file="/etc/init.d/smb" name="samba"/>
		</service>
* after service is running I changed the /etc/init.d/smb so it will exit with ret-val 1 in case of calling it with "/etc/init.d/smb start" on one member

Steps to Reproduce:
1. start cluster and services (service samba is running on first node - in my case on tom)
2. change init-skript smb on first member (on tom) in the way that start will always fail (I did it with "exit 1" in case-clause and case "start")
3. stop service smb
  
Actual results:
* rgmanager recognizes the failed service and tries to restart it (recovery=restart)
* restart fails and rgmanager tries to relocate the service
* relocate fails and messages are:

Sep 18 10:23:11 jerry clurgmgrd: [3879]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> status on script "samba" returned 1 (generic error) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Recovering failed service service:samba 
Sep 18 10:23:12 jerry clurgmgrd: [3879]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <notice> Service service:samba is stopped 


Expected results:
* restart fails so cluster does a relocate
* relocate suceeds

Additional info: An ordinary relocate (clusvcadm -r samba) is working.
Comment 1 Herbert L. Plankl 2008-09-18 05:14:24 EDT
ups.. little mistake - the messages in comment #1 are from member jerry in a second test (I tested it in both ways with the same results).

messages from the first test (tom -> jerry) from tom:

Sep 18 10:07:09 tom clurgmgrd: [5316]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Recovering failed service service:samba 
Sep 18 10:07:10 tom clurgmgrd: [5316]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:07:12 tom clurgmgrd[5316]: <err> #58: Failed opening connection to member #1 
Sep 18 10:07:12 tom clurgmgrd[5316]: <notice> Service service:samba is stopped
Comment 2 Lon Hohberger 2008-09-18 09:52:14 EDT
There's an event processing bug which causes rgmanager to sleep with a lock - this causes timeouts if multiple events come in simultaneously, which sounds like what you're hitting.

I can patch 5.2 rgmanager with the 5.3 patch if you would like to test it.

The patch looks like this:

http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/rg_event.c;h=d2c7cd331c71aeef70f7d8ec9505da6fd81af08b;hp=08137de3772aeb51914be593c03d144fcc910474;hb=50dc172c12f728ebb5916e2059b01404d94dd066;hpb=1cc68885904a1e393e2dcd6d788ae5099ef7124d
Comment 3 Herbert L. Plankl 2008-09-18 11:13:18 EDT
That was fast :)
Ja, I'd like to test it. Is it possible to get an updated rpm?
Comment 4 Lon Hohberger 2008-09-18 14:31:07 EDT
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.src.rpm
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.x86_64.rpm

Let me know if you needed a different architecture.
Comment 5 Herbert L. Plankl 2008-09-19 05:19:49 EDT
Thanks - architecture ok.
Tested it -> works well; bug seems to be fixed. Thank you!

BTW: I assume official rpms will be available with EL5.3? When will EL5.3 be released?
Comment 6 Lon Hohberger 2008-09-19 13:32:25 EDT
Correct; it will be fixed in 5.3.  I don't have current information for GA date of 5.3, but beta should be pretty soon (month or so, I think).
Comment 8 Lon Hohberger 2008-09-19 13:36:13 EDT
RHEL4 has this bug too, it's:

https://bugzilla.redhat.com/show_bug.cgi?id=461956
Comment 12 Lon Hohberger 2009-01-16 11:55:06 EST
5.3 is coming, I promise >:)
Comment 13 errata-xmlrpc 2009-01-20 15:56:43 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0101.html
Comment 14 Lon Hohberger 2009-01-22 09:43:28 EST
*** Bug 481133 has been marked as a duplicate of this bug. ***
Comment 15 Lon Hohberger 2009-02-04 13:09:27 EST
This bug has the same cause (but is a different symptom of) bug #461956

Note You need to log in before you can comment on or make changes to this bug.