Bug 149410 - rgmanager doesn't reliably work with SM magma plugin when >2 nodes
rgmanager doesn't reliably work with SM magma plugin when >2 nodes
Status: CLOSED UPSTREAM
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-22 16:07 EST by Lon Hohberger
Modified: 2009-04-16 16:16 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-02-28 18:06:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lon Hohberger 2005-02-22 16:07:44 EST
Description of problem:

When more than 2 nodes are booted and join the rgmanager service group (using a
CMAN cluster), some of the nodes seem to not detect and/or receive the OOB
service-join message on the cluster socket.  The node joining the service group
as well as all other nodes which receive the message go into 'D' state --
waiting for the nodes which "seem" to have missed the join-request.

* This doesn't happen with other magma applications while using the sm plugin
For instance, circleping works fine - regardless of the node count, or so it seems.
* This doesn't happen when using the cman plugin (instead of the sm plugin).
* This doesn't happen when using the gulm plugin.
* This doesn't happen all the time.  Running with debug logging enabled seems to
make rgmanager join the service group correctly, which doesn't make sense.  If
it was a timing issue (preventing the receipt of the OOB messages), slowing
rgmanager down should make the problem worse, not better.
* This doesn't happen with multithreaded applications which access the service
manager directly (e.g. clvmd).

Still gathering more information.
Comment 2 Lon Hohberger 2005-02-22 17:39:36 EST
Having a dedicated thread waiting for *only* cluster events seems to fix this. 
Will do slightly more testing and commit changes.
Comment 3 Lon Hohberger 2005-02-28 18:06:04 EST
Turns out it only "appeared" on the SM plugin for some unknown reason.  In
reality, it was caused by the control thread entering accept(2) and being
blocked because the listen file descriptors were not correctly set O_NONBLOCK.

Note You need to log in before you can comment on or make changes to this bug.