Bug 168518

Summary: xdr_send() is broken
Product: [Retired] Red Hat Cluster Suite Reporter: Adam "mantis" Manthei <amanthei>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 3CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-04 20:55:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adam "mantis" Manthei 2005-09-16 20:12:55 UTC
Description of problem:
The xdr_send() function assumes that sock_sendmsg() function will report an
error if a node crashes.  This assumption is exploited heavily in gulm.  For
example, callbacks are assumed to have been successfully sent and that a machine
is processing them when xdr_send returns.  This is not the case however, and as
a result it is possible to dead lock a cluster.

Under certain failure conditions, the xdr_send() function will erroneuosly
report that message was sent successfully.  One such example is when nodes are
crashed using `reboot -fn`.  

Version-Release number of selected component (if applicable):
GFS-6.0.2.27-0

How reproducible:
Always

Steps to Reproduce:
1. This can quickly be verified having more than two nodes log into lock_gulmd
at the same time.  

2. Kill one of the nodes, wait a few seconds, then kill another node (using
`reboot -fn` in both cases).  

3. Monitor /var/log/messages.  lock_gulmd_core will first send an update message
about the first node failing to all the nodes in the cluster, sending a message
successfully to the second node that failed.  

If xdr_send() is working correctly, this should report an error.
  
Actual results:
the message is sent without reporting an error

Expected results:
an error should be propagated through xdr_send()

Additional info:
bug #160494 is effected by this problem.  The work around for the problem in bug
#160494 was to add an additional queue to make sure that the call backs were
properlly sent.  Other problems may still exist where messages are being lost. 
An audit of the lock_gulmd code should be done to identify other potential
problem areas where messages are not properlly being sent and as a result being
lost.  

Note: fixing this bug may cause a severe performance hit as it might require
adding further robustness to the messaging layer of gulm.

Comment 1 Chris Feist 2006-05-04 20:55:45 UTC
Closing this unless it becomes a huge issue in the future.