Bug 244343

Summary: GFS2 - NFS posix lock issues
Product: Red Hat Enterprise Linux 5 Reporter: Wendy Cheng <nobody+wcheng>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.1CC: cluster-maint, konradr, potluri, rkenna, staubach, steved, swhiteho
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-12 17:29:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 204760    
Attachments:
Description Flags
draft patch for lock_nolock hang none

Description Wendy Cheng 2007-06-15 07:42:28 UTC
Description of problem:
These are currently upstream issues but will be in RHEL5 eventually. Open
bugzilla for tracking purpose. 

Issue 1 (lock_nolock):
The implementation will hang NFS lockd in gfs2_lock-posix_lock_file_wait
route. This can be recreated by a parent-child pair racing for the same
file lock from the very same NFS client node (e.g., using connectathon 
lock test #7). As soon as one of them obtains the lock (before releasing 
it), the next request, carried out by single threaded server side lockd,
will hang in posix_lock_file_wait forever.
 
Issue 2 (dlm_lock):
Similar test case as issue 1 (both parent and child from the same nfs 
clients race for the same file lock) would end up corrupting the 
application file because lockd mistakenly believes gfs_controld has
granted the second lock request before first one is release. c

Version-Release number of selected component (if applicable):
2.6.22 rc with Steve's git tree as of June 11

How reproducible:
issue 1 - each time and every time
issue 2 - depending on the fork-execute sequence. Easily recreated
          from a Solaris NFS client machine. 

Steps to Reproduce:
Run NFS connectathon lock test #7.
  
Actual results:
* Issue 1 hang
* Issue 2 data corruption

Additional info:

Comment 2 Wendy Cheng 2007-06-15 07:53:53 UTC
Created attachment 157068 [details]
draft patch for lock_nolock hang

Comment 3 Wendy Cheng 2007-06-15 08:03:55 UTC
For dlm_lock data corruption problem, the faulty path can be demonstrated
by the following execution sequences:

1) good run (mostly from RHEL NFS clients) - parent releases the lock before
   child gets the lock:

Test 7 - Test parent/child mutual exclusion.
        Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
        Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
        Parent: Now free child to run, should block on lock.
        Parent: Check data in file to insure child blocked.
        Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
        Parent: 7.1  - COMPARE [             ffc,               7] PASSED.
        Parent: Now unlock region so child will unblock.
        Parent: 7.2  - F_ULOCK [             ffc,               9] PASSED.
        Child:  7.3  - F_LOCK  [             ffc,               9] PASSED.
        Child:  Write child's version of the data and release lock.
        Parent: Now try to regain lock, parent should block.
        Child:  Wrote 'bebebebeb' to testfile [ 4092, 9 ].
        Child:  7.4  - F_ULOCK [             ffc,               9] PASSED.
        Parent: 7.5  - F_LOCK  [             ffc,               9] PASSED.
        Parent: Check data in file to insure child unblocked.
        Parent: Read 'bebebebeb' from testfile [ 4092, 9 ].
        Parent: 7.6  - COMPARE [             ffc,               9] PASSED.
        Parent: 7.7  - F_ULOCK [             ffc,               9] PASSED.

2) Bad run (mostly from Solaris client) - child is dispatched before
   parent releases the lock:

Test 7 - Test parent/child mutual exclusion.
        Parent: 7.0  - F_TLOCK [            1ffc,               9] PASSED.
        Parent: Wrote 'aaaa eh' to testfile [ 8188, 7 ].
        Parent: Now free child to run, should block on lock.
        Child:  7.3  - F_LOCK  [            1ffc,               9] PASSED.
        Child:  Write child's version of the data and release lock.
        Child:  Wrote 'bebebebeb' to testfile [ 8188, 9 ].
        Child:  7.4  - F_ULOCK [            1ffc,               9] PASSED.
        Parent: Check data in file to insure child blocked.
        Parent: Read 'aaaa eh' from testfile [ 8188, 7 ].
        Parent: **** Test expected 'aaaa eh', read 'bebebeb'.
        Parent: 7.1  - COMPARE [            1ffc,               7] FAILED!
        Parent: **** Expected equal, returned unequal...
        Parent: **** Probably implementation error.

For this to happen, my guess is that the checking logic inside dev_write 
is not adequate: 

        spin_lock(&ops_lock);
        list_for_each_entry(op, &recv_list, list) {
                if (op->info.fsid == info.fsid && op->info.number == info.number &&
                    op->info.owner == info.owner) {
                        list_del_init(&op->list);
                        found = 1;
                        op->done = 1;
                        memcpy(&op->info, &info, sizeof(info));
                        break;
                }
        }
        spin_unlock(&ops_lock);

(will continue tomorrow)

Comment 4 David Teigland 2007-06-15 13:50:12 UTC
We used to have lock_kernel/unlock_kernel around posix_lock_file_wait():
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/nolock/Attic/main.c?rev=1.4.6.2.2.2&content-type=text/x-cvsweb-markup&cvsroot=cluster
It seems this was lost when we moved nolock into the upstream kernel.
I don't know if that would help the current problem with nolock or not.


Comment 5 Wendy Cheng 2007-06-15 15:53:22 UTC
The lock_nolock issue is (I think) mostly to do with lockd is single threaded.
The draft patch from comment #2 works fine under NFS environment (it passes 
cthon lock test) but haven't tried it out under local (direct) gfs lock
access scenario. Alternatively, gfs2_lock under lock_nolock can follow what
other local filesystem (e.g. ext3) does by calling posix_lock_file() 
(implies gfs2_lock becomes a wrapper around posix_lock_file()). The problem 
with this is that it will make gfs2 lock_nolock breaks under NFS V4 (since 
VFS layer doesn't pass the "conflock" to gfs2_lock. Maybe this should be
considered a defect from upstream code - it should have passed the "conflock"
pointer ?). Another option is simply remove gfs2_lock from file operation 
method if lock protocol is lock_nolock during mount time (probably the most
easy, quick, and fool-proof way since we simply hand-off the issue and let 
VFS takes total controls of the locking). 

We can discuss this later after we fully understand what goes wrong with
dlm_lock case. Guess it is better to have an overall view, instead of ad-hoc
fixes here and there.

Looking into dlm_lock problem now.

Comment 6 Wendy Cheng 2007-06-15 20:56:40 UTC
This is one of the issues .. not sure how to fix it though (since it is in 
user space). I think current RHEL 5 will have this issue too (?)

In cluster/group/gfs_controld/plock.c, the do_get() checks conflicts thru
is_conflict() as:

        list_for_each_entry(po, &r->locks, list) {
                if (po->nodeid == in->nodeid && po->owner == in->owner)
                        continue;

This won't work in NFS lockd case since the owner field is actually the
"host" structure (representing the very same NFS client). So all lock 
inquiries from NFS lockd will pass this check. Within the kernel, the 
lockd passes a "nlmsvc_same_owner" function pointer to vfs's 
posix_test_lock() call to avoid this issue. But gfs controld is in user
space !!!

Guess we can hand-copy nlmsvc_smae_owner kernel code into user mode plock.c ?
This is really not nice. 







Comment 7 Wendy Cheng 2007-06-15 21:10:04 UTC
ok, really looking at the two bugs just digged out at this moment, they exist
in current RHEL 5, regardless nfs cluster locking patch in or out. In short, 
this is now a RHEL 5 bugziila. 

Comment 8 RHEL Program Management 2007-06-19 17:05:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Wendy Cheng 2007-06-22 15:19:50 UTC
Have to do bz #243136 first - without a correct nfs-gfs2 filehandle code, 
nfs locking over gfs2 is meaningless.


Comment 10 Kiersten (Kerri) Anderson 2007-06-25 15:23:10 UTC
Dependent on NFS fixes for posix locking, moving to 5.2

Comment 11 Wendy Cheng 2007-10-18 01:26:50 UTC
Need to start to work on this issue. Last time we knew test #7 would fail.
IBM just sent an email over saying #2 now fails too with 2.6.18-51.el5 
kernel. I will give it a try tonight ... 

On Mon, Oct 15, 2007 at 01:26:33PM -0700, Marc Eshel wrote:
> Konrad can you please Wendy the information she need to create the
> problem.
>
Sure.

With the 3.6.18-51.el5 kernel we did a tests where the a GFS and GFS2
two-node cluster filesystem was exported via NFS. When connecting a
RHEL3 client to it and running the NFS Connecthaton on the mounted
share (using NFS v3 udp and tcp) the tests passed successfully.

However if the same kernel is used with the locking patches the RHEL3
client NFS Connecthaton tests fail on TEST#2. They pass if you use a
RHEL4 and RHEL5 box. The lock patches are available at http://darnok.org/locks

I've also tested this with the upstream gfs2 git tree - exporting a GFS2
filesystem via NFS and running NFS Connecthaton would blow it up at
TEST#2.

Comment 12 Konrad Rzeszutek 2007-10-18 13:46:11 UTC
(In reply to comment #11)
> Need to start to work on this issue. Last time we knew test #7 would fail.
> IBM just sent an email over saying #2 now fails too with 2.6.18-51.el5 
> kernel. I will give it a try tonight ... 

What I said is that with 2.6.18-51.el5 and the locking patches it would fail. If
I use 2.6.18-51.el5 stock kernel it works fine. If I use the 2.6.23-rc3-gfs tree
it would fail too.

Comment 13 Wendy Cheng 2007-10-18 19:08:48 UTC
Borrow a RHEL3 node from Josef. First run hang on test#7 ... look more 
like an NFS client issue. Then, just realized the machine has a customized
kernel: 2.4.21-52.EL.bz249237hugemem. Will re-run. Test #2 passes though...

Test #7 - Test parent/child mutual exclusion.
        Parent: 7.0  - F_TLOCK [     ffc,       9] PASSED.
        Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
        Parent: Now free child to run, should block on lock.
        Parent: Check data in file to insure child blocked.
        Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
        Parent: 7.1  - COMPARE [     ffc,       7] PASSED.
        Parent: Now unlock region so child will unblock.


Comment 14 Wendy Cheng 2007-10-18 19:44:33 UTC
Yep.. it fails as the following on 53.el5 with Konrad's NFS locking patches.
Without NFS locking patches, it works fine. Start to debug now ...

Test #2 - Try to lock the whole file.
        Parent: 2.0  - F_TLOCK [       0,  ENDING] FAILED!
        Parent: **** Expected success, returned errno=37...
        Parent: **** Probably implementation error.

** PARENT pass 1 results: 9/9 pass, 0/0 warn, 1/1 fail (pass/total).

**  CHILD pass 1 results: 0/0 pass, 0/0 warn, 0/0 fail (pass/total).
lock tests failed
Tests failed, leaving /mnt/dhcp136.perf.redhat.com mounted


Comment 15 Wendy Cheng 2007-10-18 21:44:47 UTC
Comment #14 is an operator error (me) - since I dispatched the command too
soon (before NFSD grace period ended). On the other hand, now we don't need
Solaris clients to find problems. RHEL3 client seems to be able to generate
similar issues:

1) With lock_nolock protocol, test #7 hangs as before (see this bz's problem
   statement and comment #1). Will retest the patch and ask for internal review
   tomorrow.
2) With lock_dlm protocol, didn't get chances to run test #7 since test suite
   dies with #2:

Test #2 - Try to lock the whole file.
        Parent: 2.0  - F_TLOCK [       0,  ENDING] PASSED.
        Child:  2.1  - F_TEST  [       0,       1] FAILED!
        Child:  **** Expected EACCES, returned success...
        Child:  **** Probably implementation error.

Server's /var/log/messages file shows:

Oct 18 17:01:09 dhcp136 mountd[3886]: authenticated mount request from
vm8.gsslab.rdu.redhat.com:917 for /server (/server)
Oct 18 17:02:24 dhcp136 gfs_controld[3646]: plock result write err 0 errno 9
 
Will discuss this with dct first thing tomorrow monring. 

Comment 16 Wendy Cheng 2007-11-05 17:03:00 UTC

*** This bug has been marked as a duplicate of 232649 ***

Comment 17 Wendy Cheng 2007-11-30 21:43:19 UTC
Break this out of 232649 since we need (at least) one seperate bugzilla for 
GFS1 and user mode daemon build that is normally one cycle behind base kernel
build. 

* For GFS1 changes - will check into GFS1 CVS using this bugzilla. 
* For gfs_controld changes - will do a summary of gfs_controld (tentative) 
  changes on Monday after this weekend's planned test runs - then transfer
  the issue to Dave. All changes are subject to Dave's review.    


Comment 18 Wendy Cheng 2007-11-30 22:54:12 UTC
* GFS1-RHEL5 has checked into CVS using this bugzilla. 
* GFS(1)-kernel CVS head is broken now (I'm using 2.6.24-rc2). Need to 
  review Fabio's patches and fixes the issues accordingly. 
* Start lock_dlm testing at this moment. 

Comment 19 Wendy Cheng 2007-12-03 14:30:29 UTC
Weekend lock_dlm test runs didn't go well. I was able to pass test #2 (that 
failed previously) by adding pid checking into few calls in gfs_controld but 
couldn't pass test #7. There are totally 15 test cases in NFS cthon04 test 
suites. 

Look into the failures now. 

Comment 20 Wendy Cheng 2007-12-03 15:54:43 UTC
This is the failure scenario:

0. On NFS client machine .....
1. Create Parent and child pair.
2. Parent obtains a file lock and writes something to it.
3. Child tries to get the same file lock, if obtained, writes something. 
4. Parent reads the file (to make sure child doesn't write to it).
5. Parent releases the lock.
6. Child reads the file (to make sure it contains parent's data, followed 
   by child data).

The call fails at step (5) where the unlock should succeed but, instead,
in GFS2 lock_dlm case, it fails with errno=37 (ENOLCK).

The trace shows gfs_controld does successfully release the lock. The issue
seems to be in kernel portion of dlm_lock call gdlm_punlock() - look like
it gets the child's old entry, instead of its own entry from op->list :

Dec  3 04:34:17 dhcp146 kernel: rv=0
Dec  3 04:34:17 dhcp146 last message repeated 5 times
Dec  3 04:34:18 dhcp146 gfs_controld[6633]: plock result write err 0 errno 9
Dec  3 04:34:18 dhcp146 last message repeated 99 times
Dec  3 04:34:19 dhcp146 gfs_controld[6633]: unlock_internal in->pid=4011
Dec  3 04:34:19 dhcp146 gfs_controld[6633]: unlock_internal po->pid=4011
Dec  3 04:34:19 dhcp146 gfs_controld[6633]: unlock_internal return rv=0
Dec  3 04:34:19 dhcp146 gfs_controld[6633]: plock result write err 0 errno 9
Dec  3 04:34:19 dhcp146 kernel: rv=-11
Dec  3 04:34:19 dhcp146 kernel: gfs2_lm_punlock return -11
Dec  3 04:34:19 dhcp146 kernel: rv=-2


Comment 21 David Teigland 2007-12-03 16:06:04 UTC
In dev_write() we have this test to find the matching op:

op->info.fsid == info.fsid &&
op->info.number == info.number &&
op->info.owner == info.owner

Do we need to add something like this to distinguish between multiple
processes on the same nfs client?:

op->info.pid == info.pid


Comment 22 Wendy Cheng 2007-12-03 16:43:09 UTC
That *was* my first attempt (adding pid check into dev_write()) last night and 
it subsequently hung test #2. Let me get more trace data out to see what goes
wrong. 

Comment 23 Wendy Cheng 2007-12-04 22:08:36 UTC
The network driver in my RDU server couldn't come up. After a full day of
trying, I'm giving it up now. In the middle of moving the tests to salem
(smoke cluster) - building the new kernel could take a while though. The 
following is the issue I saw yesterday...

Ater the pid checking is added into dev_write() ...
0. On nfs client ...
1. Parent places a lock request, gdlm_plock returns -EINPROGRESS.
2. Child tries to get a lock (via test lock), lockd comes in as:
   dhcp146 kernel:  [<ffffffff8851ec7a>] :gfs2:gfs2_lock+0x94/0x110
   dhcp146 kernel:  [<ffffffff88484fde>] :lockd:nlmsvc_testlock+0x22d/0x319
   dhcp146 kernel:  [<ffffffff88488bf0>] :lockd:nlm4svc_proc_test+0x84/0xce
   dhcp146 kernel:  [<ffffffff883ec741>] :sunrpc:svc_process+0x3da/0x712
   dhcp146 kernel:  [<ffffffff8848411c>] :lockd:lockd+0x0/0x274
   dhcp146 kernel:  [<ffffffff884842a6>] :lockd:lockd+0x18a/0x274
   ... [snip]
 
   The dev_write shows the following at this moment:
   Dec  4 09:10:04 dhcp146 kernel: dev_write info.pid = 2349
   Dec  4 09:10:04 dhcp146 kernel: dev_write in list
   Dec  4 09:10:04 dhcp146 kernel: pid not match op=2350 info=2349
   Dec  4 09:10:04 dhcp146 kernel: gdlm dev_write no op 30002 102d7

   The parent pid is 2349, child pid is 2350. So the child tries to lock
   with op->pid = 2350 (itself) but whoever writes to the dev_write uses
   info.pid = 2349 .. This is the place where I get lost. Dave ?
   

Comment 24 Wendy Cheng 2007-12-04 22:15:38 UTC
As the result of comment #23, the test program hangs.... 

Comment 25 Wendy Cheng 2007-12-06 05:15:10 UTC
Worked with Dave today and got cthon04 lock test *completed*. Dave has nicely
revised the patch into a single lock_dlm kernel patch.. Will hand this issue
over to him. 

Now we just have to make sure the new patch also work well with SAMBA and other
direct GFS1/2 plock access. 

Comment 26 David Teigland 2007-12-06 15:45:05 UTC
Here's my upstream patch that makes the nfs tests work on gfs:
https://www.redhat.com/archives/cluster-devel/2007-December/msg00027.html

Konrad, I think it would be easiest if you folded that patch into yours
for bz 196318.


Comment 27 Konrad Rzeszutek 2007-12-06 16:18:59 UTC
(In reply to comment #26)
> Here's my upstream patch that makes the nfs tests work on gfs:
> https://www.redhat.com/archives/cluster-devel/2007-December/msg00027.html
> 
> Konrad, I think it would be easiest if you folded that patch into yours
> for bz 196318.
> 

I concur. I will work on that tomorrow (and do the testing tomorrow as well)

Comment 28 Konrad Rzeszutek 2007-12-10 20:51:38 UTC
so far I've tested:
GFS2 (lock_dlm) exported via NFS and RHEL3U9, RHEL4U6, RHEL5U1, Sol9 and Sol10
running the NFS connecthaton with success. 
Next stage will be GFS2 (nolock), GFS, and ext3.

Comment 29 Wendy Cheng 2007-12-11 17:05:58 UTC
Cross post to bugzilla 232649:

To support Konrad's bugzilla 196318, I believe GFS team has completed the 
works from our end. There are two patches that are not posted to rhkernel-
list yet:

1. Patch in comment #19 from this bugzilla for gfs2 lock_nolock protocol.
2. Patch in https://bugzilla.redhat.com/show_bug.cgi?id=244343#c27 for 
   gfs2 lock_dlm protocol.

Comment 30 Wendy Cheng 2007-12-11 17:06:58 UTC
sorry.. 1) should be "Patch in comment #19 of bugzilla 232649"

Comment 31 Peter Staubach 2007-12-11 18:07:02 UTC
Sorry, what does "GFS2 (nolock)" and/or "gfs2 lock_nolock_protocol" mean?

Comment 32 Wendy Cheng 2007-12-11 18:58:56 UTC
When using GFS1/2 as a single node filesystem (not-clustered), it uses 
lock_nolock protocol. That is, gfs1/2 can be used in non-clustered 
environment. Since it doesn't have cluster locking overhead, it tends to
run faster. 

Comment 33 Konrad Rzeszutek 2007-12-11 20:11:09 UTC
I've tested with NFS exported share which underneach has:
GFS2(lock_dlm), GFS2(nolock), GFS(nolock), GFS(dlm_lock), ext3 filesystem. That
NFS share was then mounted by RHEL3, RHEL4, RHEL5, Sol9, and Sol10 NFS client
machine and the NFS Connecthaton test-suite was run. The tests worked without
regression.

I've also tested in which the NFS client is the patched server with NFS targets
exported by: RHEL3, RHEL4, RHEL5, Sol9, Sol10, and NetApp with success.

Comment 34 Konrad Rzeszutek 2007-12-11 20:13:55 UTC
I forgot to mention but the tests were carried out with all of the patch that
Wendy listed in comment #29 and comment #30 of this BZ.

Comment 35 David Teigland 2007-12-12 17:29:36 UTC

*** This bug has been marked as a duplicate of 196318 ***