Bug 800300

Summary: locktests fail in "READ LOCK THE WHOLE FILE BYTE BY BYTE" test case.
Product: [Community] GlusterFS Reporter: Anush Shetty <ashetty>
Component: locksAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: amarts, gluster-bugs, nsathyan, rabhat, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-21 09:52:51 UTC Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
The Python script helps in testing "last unlock based unref'ing" in NLM. none

Description Anush Shetty 2012-03-06 09:20:40 UTC
Description of problem: While running locktests, trying to set a write lock on a write lock returned EAGAIN. This was on a distributed-replicate setup with 4 bricks.


How reproducible: Consistently


Steps to Reproduce:
1. ./locktests -n 5 -f /mnt/gluster/file
2.
3.
  
Actual results:

TEST : TRY TO WRITE ON A READ  LOCK:=====
TEST : TRY TO WRITE ON A WRITE LOCK:=====
TEST : TRY TO READ  ON A READ  LOCK:=====
TEST : TRY TO READ  ON A WRITE LOCK:=====
TEST : TRY TO SET A READ  LOCK ON A READ  LOCK:=====
TEST : TRY TO SET A WRITE LOCK ON A WRITE LOCK:Master: can't set lock
: Resource temporarily unavailable
Echec
: Resource temporarily unavailable
Init
process initalization


Expected results:

All tests should pass


Additional info:

Client log-
[2012-03-06 14:38:48.594050] D [afr-transaction.c:1002:afr_post_nonblocking_inodelk_cbk] 0-test2-replicate-1: Non blocking inodelks failed. P
roceeding to blocking
[2012-03-06 14:38:48.594110] D [afr-lk-common.c:406:transaction_lk_op] 0-test2-replicate-1: lk op is for a transaction
[2012-03-06 14:38:48.594134] D [afr-lk-common.c:607:afr_unlock_inodelk] 0-test2-replicate-1: attempting data unlock range 0 0 by 280967ffff7f
0000
[2012-03-06 14:38:48.594152] D [afr-transaction.c:1002:afr_post_nonblocking_inodelk_cbk] 0-test2-replicate-1: Non blocking inodelks failed. P
roceeding to blocking
[2012-03-06 14:38:48.594418] D [afr-lk-common.c:1427:afr_nonblocking_inodelk] 0-test2-replicate-1: attempting data lock range 0 62 by 244267f
fff7f0000
[2012-03-06 14:38:48.594597] D [afr-lk-common.c:406:transaction_lk_op] 0-test2-replicate-1: lk op is for a transaction
[2012-03-06 14:38:48.594625] D [afr-lk-common.c:607:afr_unlock_inodelk] 0-test2-replicate-1: attempting data unlock range 0 0 by ff3167ffff7f
0000
[2012-03-06 14:38:48.633362] D [fuse-bridge.c:3157:fuse_setlk_cbk] 0-glusterfs-fuse: Returning EAGAIN Flock: start=0, len=0, pid=7926, lk-owner=10ffff320eff775e
[2012-03-06 14:38:48.633650] D [afr-lk-common.c:406:transaction_lk_op] 0-test2-replicate-1: lk op is for a transaction
[2012-03-06 14:38:48.633681] D [afr-lk-common.c:607:afr_unlock_inodelk] 0-test2-replicate-1: attempting data unlock range 0 62 by 244267ffff7f0000
[2012-03-06 14:38:48.633702] D [afr-transaction.c:1002:afr_post_nonblocking_inodelk_cbk] 0-test2-replicate-1: Non blocking inodelks failed. Proceeding to blocking
[2012-03-06 14:38:48.634019] D [afr-lk-common.c:1013:afr_lock_blocking] 0-test2-replicate-1: we're done locking
[2012-03-06 14:38:48.634051] D [afr-transaction.c:982:afr_post_blocking_inodelk_cbk] 0-test2-replicate-1: Blocking inodelks done. Proceeding to FOP

Comment 1 Raghavendra Bhat 2012-03-06 09:28:47 UTC
I think its expected behavior. Can you turn flush-behind off and try again? It should not give EAGAIN with flush-behind off.

Comment 2 Anush Shetty 2012-03-06 09:39:44 UTC
I tried with flush-behind off using volume set, but still see the same problem.

Comment 3 Vijay Bellur 2012-04-04 07:45:53 UTC
Anush, can you please confirm behavior on qa33?

Comment 4 Anush Shetty 2012-04-04 09:03:23 UTC
Hi Vijay, still see this issue on qa33.

Comment 5 krishnan parthasarathi 2012-05-10 10:31:14 UTC
locktests doesnt' fail while attempting to set WRITE LOCK on WRITE LOCK on 17b0814243b4ccd56c0ce570b7f42d5e572e1e71 (git commit-id), it 'fails' in "READ LOCK THE WHOLE FILE BYTE BY BYTE" and its WRITE counterpart. I am changing the synopsis to reflect this. The commit log message explains the reason why the issue is observed.

Comment 6 Anand Avati 2012-05-10 22:26:45 UTC
CHANGE: http://review.gluster.com/3306 (locks: Set flock.l_type on successful F_UNLCK to signal last unlock on fd.) merged in master by Anand Avati (avati)

Comment 7 krishnan parthasarathi 2012-05-11 09:33:33 UTC
Created attachment 583779 [details]
The Python script helps in testing "last unlock based unref'ing" in NLM.

./simplepy /path/to/file

The script takes two write locks (fcntl) on 'adjacent' regions, on an fd and unlocks the same. To verify if the last unlock on the fd has the flock.l_type set to F_UNLCK and F_RDLCK otherwise, one must 'gdb' into one of the 'lock servers' (brick) and add breakpoint on pl_lk.

Comment 8 Anush Shetty 2012-05-18 13:31:19 UTC
This is still seen on fuse mount ( latest commit id-0cdb1d147afd644153855f6557bf7e809e5444f0). I had filed this bug for fuse mount.

Comment 9 krishnan parthasarathi 2012-05-20 05:36:31 UTC
(In reply to comment #8)
> This is still seen on fuse mount ( latest commit
> id-0cdb1d147afd644153855f6557bf7e809e5444f0). I had filed this bug for fuse
> mount.

Anush,
The fix is agnostic to the type of mount. The mention of NLM in the bug and commit log is because of what NLM expects of the locks xlator for its smooth functioning.
I am unable to recreate the issue on the commit-id you've mentioned. It would be better if you could provide access to the setup where you are seeing this issue. It would also help if you can attach the output of locktests, which says which test(s) in the collection of tests failed.

Comment 10 krishnan parthasarathi 2012-05-29 10:16:29 UTC
Unable to recreate this issue in a couple of different setups. Removing target milestone 3.3.0, since it is not reproducible consistently.

Comment 11 Amar Tumballi 2012-11-21 09:52:51 UTC
closing it as WORKSFORME as in none of our unittesting we could hit it again, If seen again, please re-open.