Description of problem: At the end of a revolver run our test suite attempted to stop the clvmd serice on node causing other nodes clvmd to hang. link-13 was running 'service clvmd stop' which never completed. All further attempts to run any cluster lvm command hung. I grabbed the stacks will attach them to this bug. Dave took a look at the stacks and said, "I think we'll need to have pjc look at this -- it does look like a dlm problem. clvmd trying to shut down dlm_recvd, but dlm_recvd is blocking on a connection semaphore Version-Release number of selected component (if applicable): 2.6.9-55.0.3.ELlargesmp dlm-kernel-2.6.9-46.16.0.5 How reproducible: Haven't tried.
Created attachment 161006 [details] link-13 sysreq-t
Created attachment 161007 [details] link-15 sysreq-t
Created attachment 161008 [details] link-16 sysreq-t
Oh good grief that "security bug" really opened a can of worms didn't it! I suspect this is down to the 'othercon' structures being freed while the receive thread is waiting to use it. I can't see any other way of getting into that situation that described in the sysreq-t dumps (receive_from_sock waiting for a semaphore that no-one seems to be holding). And the fact that all three nodes are showing the same symptoms reinforces that it's not really some odd race condition. I'm going to move the close connection logic into the receive_from_sock() code so that it always happens on the same thread. The send queue should be fine because we can manage that ourself but incoming stuff is a little more unpredictable.
Make this a duplicate of 238490 as it has all the patches and detail. *** This bug has been marked as a duplicate of 238490 ***