Bug 763263 (GLUSTER-1531)

Summary: FINODELK bailing out
Product: [Community] GlusterFS Reporter: Anush Shetty <anush>
Component: rdmaAssignee: Raghavendra G <raghavendra>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, raghavendra, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: fuse
Documentation: DNR CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
process state dump none

Description Anush Shetty 2010-09-06 02:14:17 EDT
Untaring of kernel source hung due to frame bail out. This was on a 2x3 DHT+AFR setup. 

[2010-09-06 06:09:12.788157] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.791385] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.792721] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.797044] D [dht-diskusage.c:257:dht_free_disk_available_subvol] 0532_06_09-dht: no subvolume has enough free space to create
[2010-09-06 06:09:12.797284] D [afr-transaction.c:960:afr_post_nonblocking_entrylk_cbk] 0532_06_09-replicate-0: Non blocking entrylks done. Proceeding to FOP
[2010-09-06 06:09:12.798712] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.799193] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.800034] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.800287] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.810507] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810769] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810787] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:12.811115] D [read-ahead.c:463:ra_readv] 0532_06_09-read-ahead: unexpected offset (0 != 262144) resetting
[2010-09-06 06:09:12.811466] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854320] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854385] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854447] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854748] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854767] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:15.149463] E [rpc-clnt.c:196:call_bail] 0532_06_09-client-2: bailing out frame type(GlusterFS 3.1) op(FINODELK(30)) sent = 2010-09-06 05:39:07. timeout = 1800
[2010-09-06 06:09:15.149532] D [afr-lk-common.c:983:afr_lock_blocking] 0532_06_09-replicate-0: we're done locking
[2010-09-06 06:09:15.149550] D [afr-transaction.c:887:afr_post_blocking_inodelk_cbk] 0532_06_09-replicate-0: Blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:15.149573] D [client3_1-fops.c:3505:client3_1_flush] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.149949] D [client3_1-fops.c:4148:client3_1_fxattrop] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.150122] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:15.150143] D [client3_1-fops.c:4429:client3_1_finodelk] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
Comment 1 Anush Shetty 2010-09-06 02:55:50 EDT
Created attachment 304
Comment 2 Raghavendra G 2010-09-07 01:28:51 EDT
*** Bug 1462 has been marked as a duplicate of this bug. ***
Comment 3 shishir gowda 2010-09-13 03:36:49 EDT
Not able to reproduce the issue.

State dump also does not indicate anything being blocked.
Comment 4 Raghavendra G 2010-10-04 23:54:37 EDT
*** Bug 1811 has been marked as a duplicate of this bug. ***
Comment 5 Raghavendra G 2010-10-04 23:58:04 EDT
* bug is reproducible even on a two node afr setup.
* looking at the logs, on client side, afr is not sending unlock to one of the locks granted to it and strangely it is sending unlock call to one of its children but not the other. This was verified by putting trace in between client and afr. Server side logs also confirm that the unlock call has not been received.
Comment 6 Raghavendra G 2010-10-04 23:58:54 EDT
*** Bug 1559 has been marked as a duplicate of this bug. ***
Comment 7 Raghavendra G 2010-10-19 23:25:07 EDT
*** Bug 1881 has been marked as a duplicate of this bug. ***
Comment 8 Raghavendra G 2010-11-08 02:03:02 EST
Another symptom of this bug is ping timer getting expired on rdma setup (bugs 2060 and 2043). rdma has a quota mechanism for managing the receive buffers. In order to send a request, there should be non-zero quota available and this quota is returned to pool once a reply has come back. Since finodelk requests are blocked on server side, ping request could not be transmitted to server thereby resulting in expiry of ping timer. On looking through logs, I found that afr did not send UNLOCK request to one of its children. Below are the logs
Observe the lock requests on ino=32623548.

[2010-11-08 00:54:03.954115] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954151] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954228] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954252] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954269] I [trace.c:1364:trace_finodelk] trace-0: 9483: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744073670823160)
[2010-11-08 00:54:03.954351] I [trace.c:1183:trace_fxattrop_cbk] trace-0: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954425] I [trace.c:1183:trace_fxattrop_cbk] trace-1: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954441] I [trace.c:1364:trace_finodelk] trace-0: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954468] I [trace.c:1364:trace_finodelk] trace-1: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954502] I [trace.c:840:trace_flush_cbk] trace-1: 9458: (op_ret=0, op_errno=0)
[2010-11-08 00:54:03.954521] I [trace.c:1364:trace_finodelk] trace-1: 9458: volume=rdma-replicate-0, (fd =0x2aaaae18103c, ino=32623972}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072292244719)
[2010-11-08 00:54:03.954553] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9485: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954575] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954590] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=0)
[2010-11-08 00:54:03.954621] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954640] I [trace.c:1364:trace_finodelk] trace-1: 9483: volume=rdma-replicate-0, (fd =0x2aaaae1
81090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=13935544)
[2010-11-08 00:54:03.954710] I [trace.c:1424:trace_lookup] trace-0: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954755] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954790] I [trace.c:1424:trace_lookup] trace-1: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954882] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9459: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954902] I [trace.c:1822:trace_flush] trace-1: 9459: (*fd=0x2aaaae18103c)
[2010-11-08 00:54:03.955049] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9458: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955123] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955139] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744072283501362)

As can be seen from logs, a SETLK request was sent on both trace-0 and trace-1 followed by an UNLOCK request only on trace-1.
Comment 9 Raghavendra G 2010-11-08 04:23:53 EST
Ignore my previous comment related to this bug similar to 2060 and 2043. I was wrong in my analysis. Also ignore the comment related to afr not sending UNLOCK to server.
Comment 10 Raghavendra G 2011-02-22 22:33:31 EST
The bug is no longer reproducible on latest git pull (26cedae57d5b7cb8). Hence marking this bug as resolved.