Description of problem: Process blocked on pthread_join() when using dlm style locking, which should terminate when the lock was canceled and released. Version-Release number of selected component (if applicable): dlm-1.0.3-1-x86_64 How reproducible: Every time. Steps to Reproduce: 1. save the attached program lvb2.c 2. compile it: gcc -L/usr/lib64 -g -D_REENTRANT -o lvb2 lvb2.c -ldlm -lpthread 3. start lvb2 on one window: $ ./lvb2 sleeping... Lock ID is 103d0 converting to 5 convert enq succeeded 4. start a lvb2 on another window: $ ./lvb2 sleeping... Lock ID is 1020d converting to 5 5. press <enter> on the second window: $ ./lvb2 sleeping... Lock ID is 1020d converting to 5 unlocking... unlocked. Actual results: The second lvb2 does not terminate although the lock was cancelled and released. lvb2 is actually blocked on pthread_join(). Expected results: the blocked thread of the second lvb2 must be unblocked by dlm_unlock_wait and then the program should terminate. Additional info: Waiting on release of source code from customer before I can upload.
There are two things going on here. The first is that the customer is using the synchronous call dlm_unlock_wait(CANCEL) to cancel the lock. This is wrong, it should be an asynchronous dlm_unlock(CANCEL) so that the cancel AST is delivered to the waiting process not to the cancelling one. This exposes a bug in the DLM where the astparam is overwritten by the value passed to dlm_unlock, whereas it should be preserved. With this bug, the waiting routine gets passed a bogus parameter and the process segfaults. I have checked in a patch to the RHEL4 branch to fix this behaviour. It will also need looking into for RHEL5. Checking in device.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/device.c,v <-- device.c new revision: 1.24.2.11; previous revision: 1.24.2.10 done
RHEL5 bug cloned as bz#318061
This seems to have found its way into 4.6