Bug 468274

Summary:	nfs posix locking broken on gfs
Product:	Red Hat Enterprise Linux 5	Reporter:	David Teigland <teigland>
Component:	kernel	Assignee:	David Teigland <teigland>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.2	CC:	adas, cluster-maint, edamato, konradr, staubach, steved, swhiteho
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-10-24 21:31:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Teigland 2008-10-23 20:22:46 UTC

Description of problem:

As far as I can tell, NFS posix locking on gfs has never worked, either
upstream or in RHEL (support was added in 5.2 per bz 196318.)  Under simple
tests, posix locking over nfs/gfs may appear to work.

The problem is the "owner" of the lock that nfs gives gfs/dlm.  Without
nfs in the picture, this owner is a unique process identifier.  (The process
holding a lock is a basic and essential property of a lock by definition of posix locks.)  NFS appears to be giving gfs/dlm an incrementing request counter as the owner value, instead of something that actually identifies the lock
owner.  Without real owner values, the gfs/dlm posix locking system fails
completely.

(When I refer to gfs/dlm above, it's because up through 5.2 the cluster
posix locking code lives in the gfs lock_dlm module fs/gfs2/locking/dlm/plock.c but beginning in 5.3 it moves into the dlm module fs/dlm/plock.c)

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2008-10-23 20:31:19 UTC

Simple test to observe the problem owner values:

On nfs client, repeat lock/unlock on single file several times:
# ./plockuli /mnt/a
1849 fcntl F_WRLCK ... done
press return to continue ... 
1849 fcntl F_UNLCK ... done
press return to continue ... 
1849 fcntl F_WRLCK ... done
press return to continue ... 
1849 fcntl F_UNLCK ... done
press return to continue ... 
1849 fcntl F_WRLCK ... done
press return to continue ... 
1849 fcntl F_UNLCK ... done
press return to continue ... 
1849 fcntl F_WRLCK ... done
press return to continue ... 
1849 fcntl F_UNLCK ... done
press return to continue ... 

On gfs node exporting to client above:
x/y/z in the following messages are nodeid/pid/owner.

# gfs_controld -DP
1224792871 x read plock 208004 LK WR 0-4 1/0/0 w 0
1224792871 x receive plock 208004 LK WR 0-4 1/0/0 w 0
1224792871 plock result write err 0 errno 9
1224792881 x read plock 208004 UN - 0-4 1/0/0 w 0
1224792881 x receive plock 208004 UN - 0-4 1/0/0 w 0
1224792882 x read plock 208004 LK WR 0-4 1/1/1 w 0
1224792882 x receive plock 208004 LK WR 0-4 1/1/1 w 0
1224792882 plock result write err 0 errno 9
1224792884 x read plock 208004 UN - 0-4 1/1/1 w 0
1224792884 x receive plock 208004 UN - 0-4 1/1/1 w 0
1224792884 x read plock 208004 LK WR 0-4 1/2/2 w 0
1224792884 x receive plock 208004 LK WR 0-4 1/2/2 w 0
1224792884 plock result write err 0 errno 9
1224792886 x read plock 208004 UN - 0-4 1/2/2 w 0
1224792886 x receive plock 208004 UN - 0-4 1/2/2 w 0
1224792886 x read plock 208004 LK WR 0-4 1/3/3 w 0
1224792886 x receive plock 208004 LK WR 0-4 1/3/3 w 0
1224792886 plock result write err 0 errno 9
1224792888 x read plock 208004 UN - 0-4 1/3/3 w 0
1224792888 x receive plock 208004 UN - 0-4 1/3/3 w 0

We see that the owner is 0,1,2,3... for each lock request.
(The error messages here are harmless and have been fixed.)

Comment 2 David Teigland 2008-10-23 20:36:13 UTC

Copying an email I've sent to upstream nfs developers.

Without nfs in the picture, gfs posix locks work like this:

on node 1
 process 1234 does fcntl(SETLK, WR) on inode 5
 gfs calls dlm_posix_lock(5, file, fl)
 dlm sets up the op->info structure to define the operation
 info.pid = fl->fl_pid (1234)
 info.owner = fl->fl_owner (123ABC)
 dlm sends the op to dlm_controld in userspace
 dlm_controld distributes to all nodes via openais

on nodes 1 & 2
 dlm_controld adds to lock state table:
 held locks: inode 5, lock WR, pid 1234, owner 123ABC

on node 2
 process 5678 does fcntl(SETLK, WR) on inode 5
 gfs calls dlm_posix_lock(5, file, fl)
 dlm sets up the op->info structure to define the operation
 info.pid = fl->fl_pid (5678)
 info.owner = fl->fl_owner (567DEF)
 dlm sends the op to dlm_controld in userspace
 dlm_controld distributes to all nodes via openais

on nodes 1 & 2
 dlm_controld compares this operation against held locks,
 finds that this lock op conflicts with the existing held lock
 based on comparing the nodeids (1 vs 2) and the info.owner values
 (123ABC vs 567DEF)
 adds this lock op to the waiting list, updated lock state table:
 held locks:    inode 5, lock WR, pid 1234, owner 123ABC
 waiting locks: inode 5, lock WR, pid 5678, owner 567DEF

Now, bringing nfs into the picture.  The dlm checks whether it's an
nfs lock request, if (fl->fl_lmops && fl->fl_lmops->fl_grant), and
if so, it does
 info.pid = fl->fl_pid;
 info.owner = fl->fl_pid;
presumably because fl->fl_owner is always lockd which doesn't help us
distinguish the real lock owners.  The assumption is that fl->fl_pid
does distinguish lock owners.

However, when I start process 9876 on an nfs client mounting gfs, and do
the first SETLK, dlm_controld sees info.pid and info.owner of 0, i.e.
fl->fl_pid is 0.  process 9876 unlocks it, and does the same thing again;
this time dlm_controld sees info.pid and info.owner of 1, i.e. fl->fl_pid
is 1; each client request results in an incremented fl_pid.  This, of
course, doesn't work as a lock owner at all.  I assume that fl->fl_pid
would somehow need to be a combination of client identifier and pid on the
client to successfully distinguish locks from two processes on two
clients.

On irc, David pointed me to nlmclnt_locks_init_private(),
__nlm_alloc_pid() and mentioned that there's a client pid that lockd could
be using instead of the counter.  NLM clearly needs to be able to
distinguish locks between different client processes itself, so the dlm
ought to be able to extend that among nodes.

Comment 3 David Teigland 2008-10-24 21:31:38 UTC

Sorry for the alarm, it looks like I wasn't careful enough.  I've done some
more testing with various combinations of overlapping locks/unlocks, and it
appears to be working after all.  NFS does use an incrementing counter for
these pids, but it's incremented "intelligently" so that it works as an
owner value.  So, it appears to be working correctly, and working as intended.  

(Note that the test case above is a bit special in the sense that no locks from
the client continue to exist between each iteration, which causes the "owner"
for the same program/pid to change each time.)