Red Hat Bugzilla – Bug 168518
xdr_send() is broken
Last modified: 2009-04-16 16:02:19 EDT
Description of problem:
The xdr_send() function assumes that sock_sendmsg() function will report an
error if a node crashes. This assumption is exploited heavily in gulm. For
example, callbacks are assumed to have been successfully sent and that a machine
is processing them when xdr_send returns. This is not the case however, and as
a result it is possible to dead lock a cluster.
Under certain failure conditions, the xdr_send() function will erroneuosly
report that message was sent successfully. One such example is when nodes are
crashed using `reboot -fn`.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. This can quickly be verified having more than two nodes log into lock_gulmd
at the same time.
2. Kill one of the nodes, wait a few seconds, then kill another node (using
`reboot -fn` in both cases).
3. Monitor /var/log/messages. lock_gulmd_core will first send an update message
about the first node failing to all the nodes in the cluster, sending a message
successfully to the second node that failed.
If xdr_send() is working correctly, this should report an error.
the message is sent without reporting an error
an error should be propagated through xdr_send()
bug #160494 is effected by this problem. The work around for the problem in bug
#160494 was to add an additional queue to make sure that the call backs were
properlly sent. Other problems may still exist where messages are being lost.
An audit of the lock_gulmd code should be done to identify other potential
problem areas where messages are not properlly being sent and as a result being
Note: fixing this bug may cause a severe performance hit as it might require
adding further robustness to the messaging layer of gulm.
Closing this unless it becomes a huge issue in the future.