Untaring of kernel source hung due to frame bail out. This was on a 2x3 DHT+AFR setup. [2010-09-06 06:09:12.788157] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.791385] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP [2010-09-06 06:09:12.792721] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.797044] D [dht-diskusage.c:257:dht_free_disk_available_subvol] 0532_06_09-dht: no subvolume has enough free space to create [2010-09-06 06:09:12.797284] D [afr-transaction.c:960:afr_post_nonblocking_entrylk_cbk] 0532_06_09-replicate-0: Non blocking entrylks done. Proceeding to FOP [2010-09-06 06:09:12.798712] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.799193] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP [2010-09-06 06:09:12.800034] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.800287] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP [2010-09-06 06:09:12.810507] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.810769] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.810787] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking [2010-09-06 06:09:12.811115] D [read-ahead.c:463:ra_readv] 0532_06_09-read-ahead: unexpected offset (0 != 262144) resetting [2010-09-06 06:09:12.811466] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.854320] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl. [2010-09-06 06:09:12.854385] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl. [2010-09-06 06:09:12.854447] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl. [2010-09-06 06:09:12.854748] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:12.854767] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking [2010-09-06 06:09:15.149463] E [rpc-clnt.c:196:call_bail] 0532_06_09-client-2: bailing out frame type(GlusterFS 3.1) op(FINODELK(30)) sent = 2010-09-06 05:39:07. timeout = 1800 [2010-09-06 06:09:15.149532] D [afr-lk-common.c:983:afr_lock_blocking] 0532_06_09-replicate-0: we're done locking [2010-09-06 06:09:15.149550] D [afr-transaction.c:887:afr_post_blocking_inodelk_cbk] 0532_06_09-replicate-0: Blocking inodelks done. Proceeding to FOP [2010-09-06 06:09:15.149573] D [client3_1-fops.c:3505:client3_1_flush] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD [2010-09-06 06:09:15.149949] D [client3_1-fops.c:4148:client3_1_fxattrop] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD [2010-09-06 06:09:15.150122] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction [2010-09-06 06:09:15.150143] D [client3_1-fops.c:4429:client3_1_finodelk] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
Created attachment 304
*** Bug 1462 has been marked as a duplicate of this bug. ***
Not able to reproduce the issue. State dump also does not indicate anything being blocked.
*** Bug 1811 has been marked as a duplicate of this bug. ***
* bug is reproducible even on a two node afr setup. * looking at the logs, on client side, afr is not sending unlock to one of the locks granted to it and strangely it is sending unlock call to one of its children but not the other. This was verified by putting trace in between client and afr. Server side logs also confirm that the unlock call has not been received.
*** Bug 1559 has been marked as a duplicate of this bug. ***
*** Bug 1881 has been marked as a duplicate of this bug. ***
Another symptom of this bug is ping timer getting expired on rdma setup (bugs 2060 and 2043). rdma has a quota mechanism for managing the receive buffers. In order to send a request, there should be non-zero quota available and this quota is returned to pool once a reply has come back. Since finodelk requests are blocked on server side, ping request could not be transmitted to server thereby resulting in expiry of ping timer. On looking through logs, I found that afr did not send UNLOCK request to one of its children. Below are the logs Observe the lock requests on ino=32623548. [2010-11-08 00:54:03.954115] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992) [2010-11-08 00:54:03.954151] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992) [2010-11-08 00:54:03.954228] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=-1, op_errno=11 [2010-11-08 00:54:03.954252] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9483: op_ret=-1, op_errno=11 [2010-11-08 00:54:03.954269] I [trace.c:1364:trace_finodelk] trace-0: 9483: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744073670823160) [2010-11-08 00:54:03.954351] I [trace.c:1183:trace_fxattrop_cbk] trace-0: 9482: (op_ret=0, op_errno=22) [2010-11-08 00:54:03.954425] I [trace.c:1183:trace_fxattrop_cbk] trace-1: 9482: (op_ret=0, op_errno=22) [2010-11-08 00:54:03.954441] I [trace.c:1364:trace_finodelk] trace-0: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008) [2010-11-08 00:54:03.954468] I [trace.c:1364:trace_finodelk] trace-1: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008) [2010-11-08 00:54:03.954502] I [trace.c:840:trace_flush_cbk] trace-1: 9458: (op_ret=0, op_errno=0) [2010-11-08 00:54:03.954521] I [trace.c:1364:trace_finodelk] trace-1: 9458: volume=rdma-replicate-0, (fd =0x2aaaae18103c, ino=32623972}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072292244719) [2010-11-08 00:54:03.954553] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9485: op_ret=-1, op_errno=11 [2010-11-08 00:54:03.954575] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954590] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=0) [2010-11-08 00:54:03.954621] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954640] I [trace.c:1364:trace_finodelk] trace-1: 9483: volume=rdma-replicate-0, (fd =0x2aaaae1 81090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=13935544) [2010-11-08 00:54:03.954710] I [trace.c:1424:trace_lookup] trace-0: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322}) [2010-11-08 00:54:03.954755] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9482: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954790] I [trace.c:1424:trace_lookup] trace-1: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322}) [2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954882] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9459: op_ret=0, op_errno=0 [2010-11-08 00:54:03.954902] I [trace.c:1822:trace_flush] trace-1: 9459: (*fd=0x2aaaae18103c) [2010-11-08 00:54:03.955049] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9458: op_ret=0, op_errno=0 [2010-11-08 00:54:03.955123] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0 [2010-11-08 00:54:03.955139] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744072283501362) As can be seen from logs, a SETLK request was sent on both trace-0 and trace-1 followed by an UNLOCK request only on trace-1.
Ignore my previous comment related to this bug similar to 2060 and 2043. I was wrong in my analysis. Also ignore the comment related to afr not sending UNLOCK to server.
The bug is no longer reproducible on latest git pull (26cedae57d5b7cb8). Hence marking this bug as resolved.