177340 – Clurgmgrd not enabling all services on start up

Bug 177340 - Clurgmgrd not enabling all services on start up

Summary: Clurgmgrd not enabling all services on start up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	175099 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-09 21:03 UTC by Henry Harris
Modified:	2009-04-16 20:19 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2007:0149
Clone Of:
Environment:
Last Closed:	2007-06-21 13:58:45 UTC
Embargoed:

Attachments	(Terms of Use)
Cluster.conf for two node cluster with 56 services (2.14 KB, text/plain) 2006-01-09 21:03 UTC, Henry Harris	no flags	Details
Cluster.conf for two node cluster with 56 services (10.98 KB, text/plain) 2006-01-09 21:11 UTC, Henry Harris	no flags	Details
Source RPM with patch (178.33 KB, application/octet-stream) 2006-02-02 17:57 UTC, Lon Hohberger	no flags	Details
Show Obsolete (1) View All

Description Henry Harris 2006-01-09 21:03:26 UTC

Description of problem: On a cluster configured with a large number of 
services, one or more services may not get enabled when rgmanager is started.
Also unable to start service with clusvcadm -e.

Version-Release number of selected component (if applicable):
1.9.43

How reproducible:
Very often

Steps to Reproduce:
1. Configure large number of services in cluster.conf
2. Start rgmanager
3. Run clustat
  
Actual results:
Clustat shows one service is stopped

Expected results:
Clustat shows all services are started

Additional info:
If you run clusvcadm -d then clustat -e on the failed service, there is no 
message in /var/log/messsages to show that clurgmgrd tried to stop or start 
the service.  In the attached cluster.conf file, the ip service 10.251.2.82 is
the only service that cannot be started in the most recent test.  The same 
problem has previously been seen with the script service nvbackup.  Usually 
there is only one service that fails, but not necessarily the same service 
every time.

Comment 1 Henry Harris 2006-01-09 21:03:28 UTC

Created attachment 122970 [details]
Cluster.conf for two node cluster with 56 services

Comment 2 Henry Harris 2006-01-09 21:11:58 UTC

Created attachment 122971 [details]
Cluster.conf for two node cluster with 56 services

Ignore previous attachment; this is the correct cluster.conf

Comment 3 Lon Hohberger 2006-01-09 21:22:09 UTC

Henry, could you clarify one point for me --

Is the service in the 'stopped' state or the 'failed' state before issuing the
-d/-e commands?

Comment 4 Henry Harris 2006-01-09 21:58:32 UTC

In the latest case where I've seen this, service 10.251.2.82 is in 
the "stopping" state.  I have seen other cases with fewer services (16 instead 
of 56) where more than one service is in the "failed" state.  However, in the 
earlier cases, I don't believe we had the lastest dlm kernel changes.  In the 
latest case we do have it.  I have the condition with 10.251.2.82 
showing "stopping" right now if you want to take a look or you can tell me 
what to look at.

Comment 5 Lon Hohberger 2006-01-16 22:12:38 UTC

Ok, if it's in the 'stopping' state, one of two things is going on:

(a) There should be one of the following errors in the log: #53, #54, #55 which
would indicate an error, or

(b) There is *no* error in the log, in which case there's likely a thread
interlocking issue.  If this is the case, it should be very easy to fix.  The
simplest thing to do is:

- Install the rgmanager-debuginfo rpm corresponding to 1.9.43.
- On the machine where there's a problem, run 'gdb `which clurgmgrd` `pidof
clurgmgrd`'.
- Grab the output of: 'thr a a bt'

I have not seen this problem yet, but will try again to reproduce it tomorrow
morning.

Comment 6 Henry Harris 2006-01-25 00:24:40 UTC

This problem has not been seen for awhile.  I do not recall seeing any error 
messages in the log previously.  If we see it again, we will try to capture 
the info you requested.

Comment 7 Lon Hohberger 2006-01-25 18:48:26 UTC

Actually, I have a patch which I think will both prevent the problem you saw as
well as increase the speed of node transition handling an order of magnitude.

Comment 8 Lon Hohberger 2006-01-25 19:36:57 UTC

Forgot to note -- the performance increase is only seen with large service
counts.  I tested with 83 services.

Comment 9 Henry Harris 2006-01-25 19:40:29 UTC

Typically we have seen this with large service counts.  When can we get the 
patch?

Comment 10 Lon Hohberger 2006-01-25 19:54:30 UTC

I intended on having it this morning, but it requires other patches which are
going in to the next release, so it will be this afternoon or tomorrow morning.

Comment 11 Lon Hohberger 2006-02-02 17:57:59 UTC

Created attachment 124064 [details]
Source RPM with patch

This SRPM should fix:

* Issues with many services
* Issues where rgmanager blocks for long periods of time (which is related)

Comment 12 Lon Hohberger 2006-02-02 18:02:58 UTC

*** Bug 175099 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.