601703 – User process quitting unexpectedly can leave locks hanging around

Bug 601703 - User process quitting unexpectedly can leave locks hanging around

Summary: User process quitting unexpectedly can leave locks hanging around

Keywords:
Status:	CLOSED DUPLICATE of bug 645531
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-08 13:45 UTC by Christine Caulfield
Modified:	2016-04-26 21:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-07-14 16:10:48 UTC
Embargoed:

Attachments	(Terms of Use)
Patch to fix (2.00 KB, patch) 2010-06-08 15:05 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Description Christine Caulfield 2010-06-08 13:45:33 UTC

Description of problem:

If a user program that uses the DLM for locking exits while a lock is in the process of being granted, then a confusion between the local and master nodes can result in the lock never being released.

Version-Release number of selected component (if applicable):
4.5+ But it's almost certainly been there since 4.1

How reproducible:
Easily

Steps to Reproduce:
1. Set up a 3 node cluster
2. Get a CR lock in the default lockspace on all nodes. This holds the
lockspace open) eg:
  # ./dlmtest -mcr -d99999999999999 holding

3. Create a program that requests a lock and then exits immediately (before it is granted). I took the lstest program from dlm/test/usertest/lstest.c and put an exit(0) call line 287

4. Run this repeatedly on all 3 nodes:
  # while [ 1 ]; do ./lstest -d1 -ldefault; sleep 1; done


Actual results:
After a short while, a lock will become jammed and examining /proc/cluster/dlm_locks will show lots and lots waiting locks, all stuck behind a lock that was granted to a process that has long since quit.

Expected results:
All locks should be cleared when the program exits.

Additional info:
There will also be lots of messages like this in dmesg:
dlm: default: (6238) dlm_unlock: 10223 busy 1
dlm: default: (6240) dlm_unlock: 10094 busy 1

I have a patch for RHEL4.5 (this bug was encountered at vodafone) which I will port to 4.8 and post here.

Comment 1 Christine Caulfield 2010-06-08 14:05:48 UTC

There are really two, possibly three problems here. The first thing is the realisation that unlocks can return EINVAL if the lock is in the wrong state for unlocking. The device unlock code doesn't handle that, it assumes that EINVAL means that the caller got something wrong and so cancels the attempt to clear the lock. This can leave locks lying around when the process exits.

The second problem is that the code that handles returned status from unlocks on a remote node also assume that unlocks cannot fail. There is even a comment to this effect in the code. So if EINVAL is received from the remote node it gets ignored and the local copy of a lock is removed from its queue when it shouldn't be.

Thirdly, and this is only disputably a problem. How does a lock get into the state where an unlock can cause EINVAL in the first place? This is basically a race where a cancel request and a grant cross on the network so that the master node thinks the lock is granted and the local node doesn't. It's actually even more complicated that that - but, as a description, it'll do.

Comment 2 Christine Caulfield 2010-06-08 15:05:09 UTC

Created attachment 422240 [details]
Patch to fix

The RHEL4.5 equivalent of this patch works for me but I have not yet heard back from the customer. 

This patch is for the RHEL4 branch of git, I know it compiles but haven't tested it on RHEL4.8+ yet

Comment 3 David Teigland 2010-10-26 18:55:21 UTC

recent related work has been happening on bug 645531
this one should probably be close as dup of that

Comment 6 David Teigland 2011-07-14 16:10:48 UTC

If someone has a problem with this, they should be able to work around it using the option from bug 645531.

*** This bug has been marked as a duplicate of bug 645531 ***

Note You need to log in before you can comment on or make changes to this bug.