Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 306391

Summary: Not able to unblock/cancel a thread waiting on a lock from another thread.
Product: [Retired] Red Hat Cluster Suite Reporter: Norm Murray <nmurray>
Component: dlm-kernelAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 4CC: ccaulfie, cluster-maint, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-24 14:45:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wade Mealing 2007-09-26 04:57:43 UTC
Description of problem:

Process blocked on pthread_join() when using dlm style locking, which should
terminate when the lock was canceled and released.


Version-Release number of selected component (if applicable):

dlm-1.0.3-1-x86_64

How reproducible:

Every time.

Steps to Reproduce:

1. save the attached program lvb2.c

2. compile it:
gcc -L/usr/lib64 -g -D_REENTRANT -o lvb2 lvb2.c -ldlm -lpthread

3. start lvb2 on one window:
$ ./lvb2
sleeping...
Lock ID is 103d0
converting to 5
convert enq succeeded

4. start a lvb2 on another window:
$ ./lvb2
sleeping...
Lock ID is 1020d
converting to 5

5. press <enter> on the second window:
$ ./lvb2
sleeping...
Lock ID is 1020d
converting to 5

unlocking...
unlocked.
  

Actual results:
The second lvb2 does not terminate although the lock was cancelled and released.
lvb2 is actually blocked on pthread_join().

Expected results:
the blocked thread of the second lvb2 must be unblocked by
dlm_unlock_wait and then the program should terminate.

Additional info:

Waiting on release of source code from customer before I can upload.

Comment 1 Christine Caulfield 2007-09-26 13:57:02 UTC
There are two things going on here. The first is that the customer is using the
synchronous call dlm_unlock_wait(CANCEL) to cancel the lock. This is wrong, it
should be an asynchronous dlm_unlock(CANCEL) so that the cancel AST is delivered
to the waiting process not to the cancelling one.

This exposes a bug in the DLM where the astparam is overwritten by the value
passed to dlm_unlock, whereas it should be preserved. With this bug, the waiting
routine gets passed a bogus parameter and the process segfaults.

I have checked in a patch to the RHEL4 branch to fix this behaviour. It will
also need looking into for RHEL5.

Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/device.c,v  <--  device.c
new revision: 1.24.2.11; previous revision: 1.24.2.10
done


Comment 4 Christine Caulfield 2007-10-04 10:59:24 UTC
RHEL5 bug cloned as bz#318061

Comment 5 Christine Caulfield 2007-10-04 12:29:29 UTC
This seems to have found its way into 4.6