168518 – xdr_send() is broken

Bug 168518 - xdr_send() is broken

Summary: xdr_send() is broken

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Chris Feist
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-09-16 20:12 UTC by Adam "mantis" Manthei
Modified:	2009-04-16 20:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-05-04 20:55:45 UTC
Embargoed:

Attachments	(Terms of Use)

Description Adam "mantis" Manthei 2005-09-16 20:12:55 UTC

Description of problem:
The xdr_send() function assumes that sock_sendmsg() function will report an
error if a node crashes.  This assumption is exploited heavily in gulm.  For
example, callbacks are assumed to have been successfully sent and that a machine
is processing them when xdr_send returns.  This is not the case however, and as
a result it is possible to dead lock a cluster.

Under certain failure conditions, the xdr_send() function will erroneuosly
report that message was sent successfully.  One such example is when nodes are
crashed using `reboot -fn`.  

Version-Release number of selected component (if applicable):
GFS-6.0.2.27-0

How reproducible:
Always

Steps to Reproduce:
1. This can quickly be verified having more than two nodes log into lock_gulmd
at the same time.  

2. Kill one of the nodes, wait a few seconds, then kill another node (using
`reboot -fn` in both cases).  

3. Monitor /var/log/messages.  lock_gulmd_core will first send an update message
about the first node failing to all the nodes in the cluster, sending a message
successfully to the second node that failed.  

If xdr_send() is working correctly, this should report an error.
  
Actual results:
the message is sent without reporting an error

Expected results:
an error should be propagated through xdr_send()

Additional info:
bug #160494 is effected by this problem.  The work around for the problem in bug
#160494 was to add an additional queue to make sure that the call backs were
properlly sent.  Other problems may still exist where messages are being lost. 
An audit of the lock_gulmd code should be done to identify other potential
problem areas where messages are not properlly being sent and as a result being
lost.  

Note: fixing this bug may cause a severe performance hit as it might require
adding further robustness to the messaging layer of gulm.

Comment 1 Chris Feist 2006-05-04 20:55:45 UTC

Closing this unless it becomes a huge issue in the future.

Note You need to log in before you can comment on or make changes to this bug.