Bug 763263 (GLUSTER-1531) - FINODELK bailing out
Summary: FINODELK bailing out
Keywords:
Status: CLOSED WORKSFORME
Alias: GLUSTER-1531
Product: GlusterFS
Classification: Community
Component: rdma
Version: mainline
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
: 763194 GLUSTER-1559 763543 763613 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-06 06:14 UTC by Anush Shetty
Modified: 2015-12-01 16:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTNR
Mount Type: fuse
Documentation: DNR
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
process state dump (96.26 KB, application/octet-stream)
2010-09-06 06:55 UTC, Anush Shetty
no flags Details

Description Anush Shetty 2010-09-06 06:14:17 UTC
Untaring of kernel source hung due to frame bail out. This was on a 2x3 DHT+AFR setup. 

[2010-09-06 06:09:12.788157] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.791385] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.792721] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.797044] D [dht-diskusage.c:257:dht_free_disk_available_subvol] 0532_06_09-dht: no subvolume has enough free space to create
[2010-09-06 06:09:12.797284] D [afr-transaction.c:960:afr_post_nonblocking_entrylk_cbk] 0532_06_09-replicate-0: Non blocking entrylks done. Proceeding to FOP
[2010-09-06 06:09:12.798712] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.799193] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.800034] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.800287] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.810507] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810769] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810787] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:12.811115] D [read-ahead.c:463:ra_readv] 0532_06_09-read-ahead: unexpected offset (0 != 262144) resetting
[2010-09-06 06:09:12.811466] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854320] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854385] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854447] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854748] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854767] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:15.149463] E [rpc-clnt.c:196:call_bail] 0532_06_09-client-2: bailing out frame type(GlusterFS 3.1) op(FINODELK(30)) sent = 2010-09-06 05:39:07. timeout = 1800
[2010-09-06 06:09:15.149532] D [afr-lk-common.c:983:afr_lock_blocking] 0532_06_09-replicate-0: we're done locking
[2010-09-06 06:09:15.149550] D [afr-transaction.c:887:afr_post_blocking_inodelk_cbk] 0532_06_09-replicate-0: Blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:15.149573] D [client3_1-fops.c:3505:client3_1_flush] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.149949] D [client3_1-fops.c:4148:client3_1_fxattrop] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.150122] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:15.150143] D [client3_1-fops.c:4429:client3_1_finodelk] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD

Comment 1 Anush Shetty 2010-09-06 06:55:50 UTC
Created attachment 304

Comment 2 Raghavendra G 2010-09-07 05:28:51 UTC
*** Bug 1462 has been marked as a duplicate of this bug. ***

Comment 3 shishir gowda 2010-09-13 07:36:49 UTC
Not able to reproduce the issue.

State dump also does not indicate anything being blocked.

Comment 4 Raghavendra G 2010-10-05 03:54:37 UTC
*** Bug 1811 has been marked as a duplicate of this bug. ***

Comment 5 Raghavendra G 2010-10-05 03:58:04 UTC
* bug is reproducible even on a two node afr setup.
* looking at the logs, on client side, afr is not sending unlock to one of the locks granted to it and strangely it is sending unlock call to one of its children but not the other. This was verified by putting trace in between client and afr. Server side logs also confirm that the unlock call has not been received.

Comment 6 Raghavendra G 2010-10-05 03:58:54 UTC
*** Bug 1559 has been marked as a duplicate of this bug. ***

Comment 7 Raghavendra G 2010-10-20 03:25:07 UTC
*** Bug 1881 has been marked as a duplicate of this bug. ***

Comment 8 Raghavendra G 2010-11-08 07:03:02 UTC
Another symptom of this bug is ping timer getting expired on rdma setup (bugs 2060 and 2043). rdma has a quota mechanism for managing the receive buffers. In order to send a request, there should be non-zero quota available and this quota is returned to pool once a reply has come back. Since finodelk requests are blocked on server side, ping request could not be transmitted to server thereby resulting in expiry of ping timer. On looking through logs, I found that afr did not send UNLOCK request to one of its children. Below are the logs
Observe the lock requests on ino=32623548.

[2010-11-08 00:54:03.954115] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954151] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954228] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954252] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954269] I [trace.c:1364:trace_finodelk] trace-0: 9483: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744073670823160)
[2010-11-08 00:54:03.954351] I [trace.c:1183:trace_fxattrop_cbk] trace-0: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954425] I [trace.c:1183:trace_fxattrop_cbk] trace-1: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954441] I [trace.c:1364:trace_finodelk] trace-0: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954468] I [trace.c:1364:trace_finodelk] trace-1: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954502] I [trace.c:840:trace_flush_cbk] trace-1: 9458: (op_ret=0, op_errno=0)
[2010-11-08 00:54:03.954521] I [trace.c:1364:trace_finodelk] trace-1: 9458: volume=rdma-replicate-0, (fd =0x2aaaae18103c, ino=32623972}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072292244719)
[2010-11-08 00:54:03.954553] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9485: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954575] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954590] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=0)
[2010-11-08 00:54:03.954621] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954640] I [trace.c:1364:trace_finodelk] trace-1: 9483: volume=rdma-replicate-0, (fd =0x2aaaae1
81090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=13935544)
[2010-11-08 00:54:03.954710] I [trace.c:1424:trace_lookup] trace-0: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954755] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954790] I [trace.c:1424:trace_lookup] trace-1: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954882] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9459: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954902] I [trace.c:1822:trace_flush] trace-1: 9459: (*fd=0x2aaaae18103c)
[2010-11-08 00:54:03.955049] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9458: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955123] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955139] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744072283501362)

As can be seen from logs, a SETLK request was sent on both trace-0 and trace-1 followed by an UNLOCK request only on trace-1.

Comment 9 Raghavendra G 2010-11-08 09:23:53 UTC
Ignore my previous comment related to this bug similar to 2060 and 2043. I was wrong in my analysis. Also ignore the comment related to afr not sending UNLOCK to server.

Comment 10 Raghavendra G 2011-02-23 03:33:31 UTC
The bug is no longer reproducible on latest git pull (26cedae57d5b7cb8). Hence marking this bug as resolved.


Note You need to log in before you can comment on or make changes to this bug.