Description of problem: The xdr_send() function assumes that sock_sendmsg() function will report an error if a node crashes. This assumption is exploited heavily in gulm. For example, callbacks are assumed to have been successfully sent and that a machine is processing them when xdr_send returns. This is not the case however, and as a result it is possible to dead lock a cluster. Under certain failure conditions, the xdr_send() function will erroneuosly report that message was sent successfully. One such example is when nodes are crashed using `reboot -fn`. Version-Release number of selected component (if applicable): GFS-6.0.2.27-0 How reproducible: Always Steps to Reproduce: 1. This can quickly be verified having more than two nodes log into lock_gulmd at the same time. 2. Kill one of the nodes, wait a few seconds, then kill another node (using `reboot -fn` in both cases). 3. Monitor /var/log/messages. lock_gulmd_core will first send an update message about the first node failing to all the nodes in the cluster, sending a message successfully to the second node that failed. If xdr_send() is working correctly, this should report an error. Actual results: the message is sent without reporting an error Expected results: an error should be propagated through xdr_send() Additional info: bug #160494 is effected by this problem. The work around for the problem in bug #160494 was to add an additional queue to make sure that the call backs were properlly sent. Other problems may still exist where messages are being lost. An audit of the lock_gulmd code should be done to identify other potential problem areas where messages are not properlly being sent and as a result being lost. Note: fixing this bug may cause a severe performance hit as it might require adding further robustness to the messaging layer of gulm.
Closing this unless it becomes a huge issue in the future.