149410 – rgmanager doesn't reliably work with SM magma plugin when >2 nodes

Bug 149410 - rgmanager doesn't reliably work with SM magma plugin when >2 nodes

Summary: rgmanager doesn't reliably work with SM magma plugin when >2 nodes

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-22 21:07 UTC by Lon Hohberger
Modified:	2009-04-16 20:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-02-28 23:06:04 UTC
Embargoed:

Attachments	(Terms of Use)

Description Lon Hohberger 2005-02-22 21:07:44 UTC

Description of problem:

When more than 2 nodes are booted and join the rgmanager service group (using a
CMAN cluster), some of the nodes seem to not detect and/or receive the OOB
service-join message on the cluster socket.  The node joining the service group
as well as all other nodes which receive the message go into 'D' state --
waiting for the nodes which "seem" to have missed the join-request.

* This doesn't happen with other magma applications while using the sm plugin
For instance, circleping works fine - regardless of the node count, or so it seems.
* This doesn't happen when using the cman plugin (instead of the sm plugin).
* This doesn't happen when using the gulm plugin.
* This doesn't happen all the time.  Running with debug logging enabled seems to
make rgmanager join the service group correctly, which doesn't make sense.  If
it was a timing issue (preventing the receipt of the OOB messages), slowing
rgmanager down should make the problem worse, not better.
* This doesn't happen with multithreaded applications which access the service
manager directly (e.g. clvmd).

Still gathering more information.

Comment 2 Lon Hohberger 2005-02-22 22:39:36 UTC

Having a dedicated thread waiting for *only* cluster events seems to fix this. 
Will do slightly more testing and commit changes.

Comment 3 Lon Hohberger 2005-02-28 23:06:04 UTC

Turns out it only "appeared" on the SM plugin for some unknown reason.  In
reality, it was caused by the control thread entering accept(2) and being
blocked because the listen file descriptors were not correctly set O_NONBLOCK.

Note You need to log in before you can comment on or make changes to this bug.