Bug 462679

Summary: rgmanager fails to connect to other members when using recovery=restart policy
Product: Red Hat Enterprise Linux 5 Reporter: Herbert L. Plankl <h.plankl>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: clasohm, cluster-maint, cmarthal, edamato, godrimator, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:56:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Herbert L. Plankl 2008-09-18 08:48:37 UTC
Description of problem:
rgmanager tries to restart a failed service, which returns exit 1 while starting. So the cluster should relocate the service (recovery=restart) to another node, but that doesn't work:

Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 


Version-Release number of selected component (if applicable):
EL5.2 x86_64 + latest updates

How reproducible:
* 2-node-cluster
* following domain
		<failoverdomains>
			<failoverdomain name="tom" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="tom" priority="1"/>
				<failoverdomainnode name="jerry" priority="2"/>
			</failoverdomain>
		</failoverdomains>
* following service
		<service autostart="1" domain="tom" exclusive="0" name="samba" recovery="restart">
			<script file="/etc/init.d/smb" name="samba"/>
		</service>
* after service is running I changed the /etc/init.d/smb so it will exit with ret-val 1 in case of calling it with "/etc/init.d/smb start" on one member

Steps to Reproduce:
1. start cluster and services (service samba is running on first node - in my case on tom)
2. change init-skript smb on first member (on tom) in the way that start will always fail (I did it with "exit 1" in case-clause and case "start")
3. stop service smb
  
Actual results:
* rgmanager recognizes the failed service and tries to restart it (recovery=restart)
* restart fails and rgmanager tries to relocate the service
* relocate fails and messages are:

Sep 18 10:23:11 jerry clurgmgrd: [3879]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> status on script "samba" returned 1 (generic error) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Recovering failed service service:samba 
Sep 18 10:23:12 jerry clurgmgrd: [3879]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <notice> Service service:samba is stopped 


Expected results:
* restart fails so cluster does a relocate
* relocate suceeds

Additional info: An ordinary relocate (clusvcadm -r samba) is working.

Comment 1 Herbert L. Plankl 2008-09-18 09:14:24 UTC
ups.. little mistake - the messages in comment #1 are from member jerry in a second test (I tested it in both ways with the same results).

messages from the first test (tom -> jerry) from tom:

Sep 18 10:07:09 tom clurgmgrd: [5316]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Recovering failed service service:samba 
Sep 18 10:07:10 tom clurgmgrd: [5316]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:07:12 tom clurgmgrd[5316]: <err> #58: Failed opening connection to member #1 
Sep 18 10:07:12 tom clurgmgrd[5316]: <notice> Service service:samba is stopped

Comment 2 Lon Hohberger 2008-09-18 13:52:14 UTC
There's an event processing bug which causes rgmanager to sleep with a lock - this causes timeouts if multiple events come in simultaneously, which sounds like what you're hitting.

I can patch 5.2 rgmanager with the 5.3 patch if you would like to test it.

The patch looks like this:

http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/rg_event.c;h=d2c7cd331c71aeef70f7d8ec9505da6fd81af08b;hp=08137de3772aeb51914be593c03d144fcc910474;hb=50dc172c12f728ebb5916e2059b01404d94dd066;hpb=1cc68885904a1e393e2dcd6d788ae5099ef7124d

Comment 3 Herbert L. Plankl 2008-09-18 15:13:18 UTC
That was fast :)
Ja, I'd like to test it. Is it possible to get an updated rpm?

Comment 4 Lon Hohberger 2008-09-18 18:31:07 UTC
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.src.rpm
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.x86_64.rpm

Let me know if you needed a different architecture.

Comment 5 Herbert L. Plankl 2008-09-19 09:19:49 UTC
Thanks - architecture ok.
Tested it -> works well; bug seems to be fixed. Thank you!

BTW: I assume official rpms will be available with EL5.3? When will EL5.3 be released?

Comment 6 Lon Hohberger 2008-09-19 17:32:25 UTC
Correct; it will be fixed in 5.3.  I don't have current information for GA date of 5.3, but beta should be pretty soon (month or so, I think).

Comment 8 Lon Hohberger 2008-09-19 17:36:13 UTC
RHEL4 has this bug too, it's:

https://bugzilla.redhat.com/show_bug.cgi?id=461956

Comment 12 Lon Hohberger 2009-01-16 16:55:06 UTC
5.3 is coming, I promise >:)

Comment 13 errata-xmlrpc 2009-01-20 20:56:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0101.html

Comment 14 Lon Hohberger 2009-01-22 14:43:28 UTC
*** Bug 481133 has been marked as a duplicate of this bug. ***

Comment 15 Lon Hohberger 2009-02-04 18:09:27 UTC
This bug has the same cause (but is a different symptom of) bug #461956