Bug 149410

Summary:	rgmanager doesn't reliably work with SM magma plugin when >2 nodes
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Lon Hohberger <lhh>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED UPSTREAM	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint, teigland
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-02-28 23:06:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lon Hohberger 2005-02-22 21:07:44 UTC

Description of problem:

When more than 2 nodes are booted and join the rgmanager service group (using a
CMAN cluster), some of the nodes seem to not detect and/or receive the OOB
service-join message on the cluster socket.  The node joining the service group
as well as all other nodes which receive the message go into 'D' state --
waiting for the nodes which "seem" to have missed the join-request.

* This doesn't happen with other magma applications while using the sm plugin
For instance, circleping works fine - regardless of the node count, or so it seems.
* This doesn't happen when using the cman plugin (instead of the sm plugin).
* This doesn't happen when using the gulm plugin.
* This doesn't happen all the time.  Running with debug logging enabled seems to
make rgmanager join the service group correctly, which doesn't make sense.  If
it was a timing issue (preventing the receipt of the OOB messages), slowing
rgmanager down should make the problem worse, not better.
* This doesn't happen with multithreaded applications which access the service
manager directly (e.g. clvmd).

Still gathering more information.

Comment 2 Lon Hohberger 2005-02-22 22:39:36 UTC

Having a dedicated thread waiting for *only* cluster events seems to fix this. 
Will do slightly more testing and commit changes.

Comment 3 Lon Hohberger 2005-02-28 23:06:04 UTC

Turns out it only "appeared" on the SM plugin for some unknown reason.  In
reality, it was caused by the control thread entering accept(2) and being
blocked because the listen file descriptors were not correctly set O_NONBLOCK.