306391 – Not able to unblock/cancel a thread waiting on a lock from another thread.

Bug 306391 - Not able to unblock/cancel a thread waiting on a lock from another thread.

Summary: Not able to unblock/cancel a thread waiting on a lock from another thread.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-26 04:57 UTC by Norm Murray
Modified:	2018-10-19 22:23 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-24 14:45:58 UTC
Embargoed:

Attachments	(Terms of Use)

Description Wade Mealing 2007-09-26 04:57:43 UTC

Description of problem:

Process blocked on pthread_join() when using dlm style locking, which should
terminate when the lock was canceled and released.


Version-Release number of selected component (if applicable):

dlm-1.0.3-1-x86_64

How reproducible:

Every time.

Steps to Reproduce:

1. save the attached program lvb2.c

2. compile it:
gcc -L/usr/lib64 -g -D_REENTRANT -o lvb2 lvb2.c -ldlm -lpthread

3. start lvb2 on one window:
$ ./lvb2
sleeping...
Lock ID is 103d0
converting to 5
convert enq succeeded

4. start a lvb2 on another window:
$ ./lvb2
sleeping...
Lock ID is 1020d
converting to 5

5. press <enter> on the second window:
$ ./lvb2
sleeping...
Lock ID is 1020d
converting to 5

unlocking...
unlocked.
  

Actual results:
The second lvb2 does not terminate although the lock was cancelled and released.
lvb2 is actually blocked on pthread_join().

Expected results:
the blocked thread of the second lvb2 must be unblocked by
dlm_unlock_wait and then the program should terminate.

Additional info:

Waiting on release of source code from customer before I can upload.

Comment 1 Christine Caulfield 2007-09-26 13:57:02 UTC

There are two things going on here. The first is that the customer is using the
synchronous call dlm_unlock_wait(CANCEL) to cancel the lock. This is wrong, it
should be an asynchronous dlm_unlock(CANCEL) so that the cancel AST is delivered
to the waiting process not to the cancelling one.

This exposes a bug in the DLM where the astparam is overwritten by the value
passed to dlm_unlock, whereas it should be preserved. With this bug, the waiting
routine gets passed a bogus parameter and the process segfaults.

I have checked in a patch to the RHEL4 branch to fix this behaviour. It will
also need looking into for RHEL5.

Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/device.c,v  <--  device.c
new revision: 1.24.2.11; previous revision: 1.24.2.10
done

Comment 4 Christine Caulfield 2007-10-04 10:59:24 UTC

RHEL5 bug cloned as bz#318061

Comment 5 Christine Caulfield 2007-10-04 12:29:29 UTC

This seems to have found its way into 4.6

Note You need to log in before you can comment on or make changes to this bug.