Bug 168518 - xdr_send() is broken
xdr_send() is broken
Status: CLOSED WONTFIX
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gulm (Show other bugs)
3
All Linux
medium Severity medium
: ---
: ---
Assigned To: Chris Feist
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-09-16 16:12 EDT by Adam "mantis" Manthei
Modified: 2009-04-16 16:02 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-04 16:55:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Adam "mantis" Manthei 2005-09-16 16:12:55 EDT
Description of problem:
The xdr_send() function assumes that sock_sendmsg() function will report an
error if a node crashes.  This assumption is exploited heavily in gulm.  For
example, callbacks are assumed to have been successfully sent and that a machine
is processing them when xdr_send returns.  This is not the case however, and as
a result it is possible to dead lock a cluster.

Under certain failure conditions, the xdr_send() function will erroneuosly
report that message was sent successfully.  One such example is when nodes are
crashed using `reboot -fn`.  

Version-Release number of selected component (if applicable):
GFS-6.0.2.27-0

How reproducible:
Always

Steps to Reproduce:
1. This can quickly be verified having more than two nodes log into lock_gulmd
at the same time.  

2. Kill one of the nodes, wait a few seconds, then kill another node (using
`reboot -fn` in both cases).  

3. Monitor /var/log/messages.  lock_gulmd_core will first send an update message
about the first node failing to all the nodes in the cluster, sending a message
successfully to the second node that failed.  

If xdr_send() is working correctly, this should report an error.
  
Actual results:
the message is sent without reporting an error

Expected results:
an error should be propagated through xdr_send()

Additional info:
bug #160494 is effected by this problem.  The work around for the problem in bug
#160494 was to add an additional queue to make sure that the call backs were
properlly sent.  Other problems may still exist where messages are being lost. 
An audit of the lock_gulmd code should be done to identify other potential
problem areas where messages are not properlly being sent and as a result being
lost.  

Note: fixing this bug may cause a severe performance hit as it might require
adding further robustness to the messaging layer of gulm.
Comment 1 Chris Feist 2006-05-04 16:55:45 EDT
Closing this unless it becomes a huge issue in the future.

Note You need to log in before you can comment on or make changes to this bug.