Bug 462679

Summary:	rgmanager fails to connect to other members when using recovery=restart policy
Product:	Red Hat Enterprise Linux 5	Reporter:	Herbert L. Plankl <h.plankl>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.2	CC:	clasohm, cluster-maint, cmarthal, edamato, godrimator, tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-01-20 20:56:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Herbert L. Plankl 2008-09-18 08:48:37 UTC

Description of problem:
rgmanager tries to restart a failed service, which returns exit 1 while starting. So the cluster should relocate the service (recovery=restart) to another node, but that doesn't work:

Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 


Version-Release number of selected component (if applicable):
EL5.2 x86_64 + latest updates

How reproducible:
* 2-node-cluster
* following domain
		<failoverdomains>
			<failoverdomain name="tom" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="tom" priority="1"/>
				<failoverdomainnode name="jerry" priority="2"/>
			</failoverdomain>
		</failoverdomains>
* following service
		<service autostart="1" domain="tom" exclusive="0" name="samba" recovery="restart">
			<script file="/etc/init.d/smb" name="samba"/>
		</service>
* after service is running I changed the /etc/init.d/smb so it will exit with ret-val 1 in case of calling it with "/etc/init.d/smb start" on one member

Steps to Reproduce:
1. start cluster and services (service samba is running on first node - in my case on tom)
2. change init-skript smb on first member (on tom) in the way that start will always fail (I did it with "exit 1" in case-clause and case "start")
3. stop service smb
  
Actual results:
* rgmanager recognizes the failed service and tries to restart it (recovery=restart)
* restart fails and rgmanager tries to relocate the service
* relocate fails and messages are:

Sep 18 10:23:11 jerry clurgmgrd: [3879]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> status on script "samba" returned 1 (generic error) 
Sep 18 10:23:11 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Recovering failed service service:samba 
Sep 18 10:23:12 jerry clurgmgrd: [3879]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Stopping service service:samba 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <notice> Service service:samba is recovering 
Sep 18 10:23:12 jerry clurgmgrd[3879]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <err> #58: Failed opening connection to member #2 
Sep 18 10:23:14 jerry clurgmgrd[3879]: <notice> Service service:samba is stopped 


Expected results:
* restart fails so cluster does a relocate
* relocate suceeds

Additional info: An ordinary relocate (clusvcadm -r samba) is working.

Comment 1 Herbert L. Plankl 2008-09-18 09:14:24 UTC

ups.. little mistake - the messages in comment #1 are from member jerry in a second test (I tested it in both ways with the same results).

messages from the first test (tom -> jerry) from tom:

Sep 18 10:07:09 tom clurgmgrd: [5316]: <err> script:samba: status of /etc/init.d/smb failed (returned 3) 
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:09 tom clurgmgrd[5316]: <notice> Recovering failed service service:samba 
Sep 18 10:07:10 tom clurgmgrd: [5316]: <err> script:samba: start of /etc/init.d/smb failed (returned 1) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> start on script "samba" returned 1 (generic error) 
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #68: Failed to start service:samba; return value: 1 
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Stopping service service:samba
Sep 18 10:07:10 tom clurgmgrd[5316]: <notice> Service service:samba is recovering
Sep 18 10:07:10 tom clurgmgrd[5316]: <warning> #71: Relocating failed service service:samba 
Sep 18 10:07:12 tom clurgmgrd[5316]: <err> #58: Failed opening connection to member #1 
Sep 18 10:07:12 tom clurgmgrd[5316]: <notice> Service service:samba is stopped

Comment 2 Lon Hohberger 2008-09-18 13:52:14 UTC

There's an event processing bug which causes rgmanager to sleep with a lock - this causes timeouts if multiple events come in simultaneously, which sounds like what you're hitting.

I can patch 5.2 rgmanager with the 5.3 patch if you would like to test it.

The patch looks like this:

http://git.fedorahosted.org/git/?p=cluster.git;a=blobdiff;f=rgmanager/src/daemons/rg_event.c;h=d2c7cd331c71aeef70f7d8ec9505da6fd81af08b;hp=08137de3772aeb51914be593c03d144fcc910474;hb=50dc172c12f728ebb5916e2059b01404d94dd066;hpb=1cc68885904a1e393e2dcd6d788ae5099ef7124d

Comment 3 Herbert L. Plankl 2008-09-18 15:13:18 UTC

That was fast :)
Ja, I'd like to test it. Is it possible to get an updated rpm?

Comment 4 Lon Hohberger 2008-09-18 18:31:07 UTC

http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.src.rpm
http://people.redhat.com/lhh/rgmanager-2.0.38-2.2.462679.x86_64.rpm

Let me know if you needed a different architecture.

Comment 5 Herbert L. Plankl 2008-09-19 09:19:49 UTC

Thanks - architecture ok.
Tested it -> works well; bug seems to be fixed. Thank you!

BTW: I assume official rpms will be available with EL5.3? When will EL5.3 be released?

Comment 6 Lon Hohberger 2008-09-19 17:32:25 UTC

Correct; it will be fixed in 5.3.  I don't have current information for GA date of 5.3, but beta should be pretty soon (month or so, I think).

Comment 8 Lon Hohberger 2008-09-19 17:36:13 UTC

RHEL4 has this bug too, it's:

https://bugzilla.redhat.com/show_bug.cgi?id=461956

Comment 12 Lon Hohberger 2009-01-16 16:55:06 UTC

5.3 is coming, I promise >:)

Comment 13 errata-xmlrpc 2009-01-20 20:56:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0101.html

Comment 14 Lon Hohberger 2009-01-22 14:43:28 UTC

*** Bug 481133 has been marked as a duplicate of this bug. ***

Comment 15 Lon Hohberger 2009-02-04 18:09:27 UTC

This bug has the same cause (but is a different symptom of) bug #461956