Bug 601703

Summary: User process quitting unexpectedly can leave locks hanging around
Product: [Retired] Red Hat Cluster Suite Reporter: Christine Caulfield <ccaulfie>
Component: dlm-kernelAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 4CC: cfeist, cluster-maint, edamato, raud
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-14 16:10:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to fix none

Description Christine Caulfield 2010-06-08 13:45:33 UTC
Description of problem:

If a user program that uses the DLM for locking exits while a lock is in the process of being granted, then a confusion between the local and master nodes can result in the lock never being released.

Version-Release number of selected component (if applicable):
4.5+ But it's almost certainly been there since 4.1

How reproducible:
Easily

Steps to Reproduce:
1. Set up a 3 node cluster
2. Get a CR lock in the default lockspace on all nodes. This holds the
lockspace open) eg:
  # ./dlmtest -mcr -d99999999999999 holding

3. Create a program that requests a lock and then exits immediately (before it is granted). I took the lstest program from dlm/test/usertest/lstest.c and put an exit(0) call line 287

4. Run this repeatedly on all 3 nodes:
  # while [ 1 ]; do ./lstest -d1 -ldefault; sleep 1; done


Actual results:
After a short while, a lock will become jammed and examining /proc/cluster/dlm_locks will show lots and lots waiting locks, all stuck behind a lock that was granted to a process that has long since quit.

Expected results:
All locks should be cleared when the program exits.

Additional info:
There will also be lots of messages like this in dmesg:
dlm: default: (6238) dlm_unlock: 10223 busy 1
dlm: default: (6240) dlm_unlock: 10094 busy 1

I have a patch for RHEL4.5 (this bug was encountered at vodafone) which I will port to 4.8 and post here.

Comment 1 Christine Caulfield 2010-06-08 14:05:48 UTC
There are really two, possibly three problems here. The first thing is the realisation that unlocks can return EINVAL if the lock is in the wrong state for unlocking. The device unlock code doesn't handle that, it assumes that EINVAL means that the caller got something wrong and so cancels the attempt to clear the lock. This can leave locks lying around when the process exits.

The second problem is that the code that handles returned status from unlocks on a remote node also assume that unlocks cannot fail. There is even a comment to this effect in the code. So if EINVAL is received from the remote node it gets ignored and the local copy of a lock is removed from its queue when it shouldn't be.

Thirdly, and this is only disputably a problem. How does a lock get into the state where an unlock can cause EINVAL in the first place? This is basically a race where a cancel request and a grant cross on the network so that the master node thinks the lock is granted and the local node doesn't. It's actually even more complicated that that - but, as a description, it'll do.

Comment 2 Christine Caulfield 2010-06-08 15:05:09 UTC
Created attachment 422240 [details]
Patch to fix

The RHEL4.5 equivalent of this patch works for me but I have not yet heard back from the customer. 

This patch is for the RHEL4 branch of git, I know it compiles but haven't tested it on RHEL4.8+ yet

Comment 3 David Teigland 2010-10-26 18:55:21 UTC
recent related work has been happening on bug 645531
this one should probably be close as dup of that

Comment 6 David Teigland 2011-07-14 16:10:48 UTC
If someone has a problem with this, they should be able to work around it using the option from bug 645531.

*** This bug has been marked as a duplicate of bug 645531 ***