Bug 177340
Summary: | Clurgmgrd not enabling all services on start up | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Henry Harris <henry.harris> | ||||||||
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4 | CC: | cluster-maint, kanderso | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHBA-2007:0149 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-06-21 13:58:45 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Henry Harris
2006-01-09 21:03:26 UTC
Created attachment 122970 [details]
Cluster.conf for two node cluster with 56 services
Created attachment 122971 [details]
Cluster.conf for two node cluster with 56 services
Ignore previous attachment; this is the correct cluster.conf
Henry, could you clarify one point for me -- Is the service in the 'stopped' state or the 'failed' state before issuing the -d/-e commands? In the latest case where I've seen this, service 10.251.2.82 is in the "stopping" state. I have seen other cases with fewer services (16 instead of 56) where more than one service is in the "failed" state. However, in the earlier cases, I don't believe we had the lastest dlm kernel changes. In the latest case we do have it. I have the condition with 10.251.2.82 showing "stopping" right now if you want to take a look or you can tell me what to look at. Ok, if it's in the 'stopping' state, one of two things is going on: (a) There should be one of the following errors in the log: #53, #54, #55 which would indicate an error, or (b) There is *no* error in the log, in which case there's likely a thread interlocking issue. If this is the case, it should be very easy to fix. The simplest thing to do is: - Install the rgmanager-debuginfo rpm corresponding to 1.9.43. - On the machine where there's a problem, run 'gdb `which clurgmgrd` `pidof clurgmgrd`'. - Grab the output of: 'thr a a bt' I have not seen this problem yet, but will try again to reproduce it tomorrow morning. This problem has not been seen for awhile. I do not recall seeing any error messages in the log previously. If we see it again, we will try to capture the info you requested. Actually, I have a patch which I think will both prevent the problem you saw as well as increase the speed of node transition handling an order of magnitude. Forgot to note -- the performance increase is only seen with large service counts. I tested with 83 services. Typically we have seen this with large service counts. When can we get the patch? I intended on having it this morning, but it requires other patches which are going in to the next release, so it will be this afternoon or tomorrow morning. Created attachment 124064 [details]
Source RPM with patch
This SRPM should fix:
* Issues with many services
* Issues where rgmanager blocks for long periods of time (which is related)
*** Bug 175099 has been marked as a duplicate of this bug. *** |