Bug 763263 - (GLUSTER-1531) FINODELK bailing out
FINODELK bailing out
Status: CLOSED WORKSFORME
Product: GlusterFS
Classification: Community
Component: rdma (Show other bugs)
mainline
All Linux
low Severity high
: ---
: ---
Assigned To: Raghavendra G
:
: 763194 GLUSTER-1559 763543 763613 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-09-06 02:14 EDT by Anush Shetty
Modified: 2015-12-01 11:45 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTNR
Mount Type: fuse
Documentation: DNR
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
process state dump (96.26 KB, application/octet-stream)
2010-09-06 02:55 EDT, Anush Shetty
no flags Details

  None (edit)
Description Anush Shetty 2010-09-06 02:14:17 EDT
Untaring of kernel source hung due to frame bail out. This was on a 2x3 DHT+AFR setup. 

[2010-09-06 06:09:12.788157] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.791385] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.792721] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.797044] D [dht-diskusage.c:257:dht_free_disk_available_subvol] 0532_06_09-dht: no subvolume has enough free space to create
[2010-09-06 06:09:12.797284] D [afr-transaction.c:960:afr_post_nonblocking_entrylk_cbk] 0532_06_09-replicate-0: Non blocking entrylks done. Proceeding to FOP
[2010-09-06 06:09:12.798712] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.799193] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.800034] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.800287] D [afr-transaction.c:912:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:12.810507] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810769] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.810787] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:12.811115] D [read-ahead.c:463:ra_readv] 0532_06_09-read-ahead: unexpected offset (0 != 262144) resetting
[2010-09-06 06:09:12.811466] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854320] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854385] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854447] D [afr-common.c:554:afr_lookup_collect_xattr] 0532_06_09-replicate-0: data self-heal is pending for /linux-2.6.36-rc3/Documentation/DocBook/genericirq.tmpl.
[2010-09-06 06:09:12.854748] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:12.854767] D [afr-transaction.c:906:afr_post_nonblocking_inodelk_cbk] 0532_06_09-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2010-09-06 06:09:15.149463] E [rpc-clnt.c:196:call_bail] 0532_06_09-client-2: bailing out frame type(GlusterFS 3.1) op(FINODELK(30)) sent = 2010-09-06 05:39:07. timeout = 1800
[2010-09-06 06:09:15.149532] D [afr-lk-common.c:983:afr_lock_blocking] 0532_06_09-replicate-0: we're done locking
[2010-09-06 06:09:15.149550] D [afr-transaction.c:887:afr_post_blocking_inodelk_cbk] 0532_06_09-replicate-0: Blocking inodelks done. Proceeding to FOP
[2010-09-06 06:09:15.149573] D [client3_1-fops.c:3505:client3_1_flush] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.149949] D [client3_1-fops.c:4148:client3_1_fxattrop] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
[2010-09-06 06:09:15.150122] D [afr-lk-common.c:410:transaction_lk_op] 0532_06_09-replicate-0: lk op is for a transaction
[2010-09-06 06:09:15.150143] D [client3_1-fops.c:4429:client3_1_finodelk] 0532_06_09-client-0: (547358544): failed to get fd ctx. EBADFD
Comment 1 Anush Shetty 2010-09-06 02:55:50 EDT
Created attachment 304
Comment 2 Raghavendra G 2010-09-07 01:28:51 EDT
*** Bug 1462 has been marked as a duplicate of this bug. ***
Comment 3 shishir gowda 2010-09-13 03:36:49 EDT
Not able to reproduce the issue.

State dump also does not indicate anything being blocked.
Comment 4 Raghavendra G 2010-10-04 23:54:37 EDT
*** Bug 1811 has been marked as a duplicate of this bug. ***
Comment 5 Raghavendra G 2010-10-04 23:58:04 EDT
* bug is reproducible even on a two node afr setup.
* looking at the logs, on client side, afr is not sending unlock to one of the locks granted to it and strangely it is sending unlock call to one of its children but not the other. This was verified by putting trace in between client and afr. Server side logs also confirm that the unlock call has not been received.
Comment 6 Raghavendra G 2010-10-04 23:58:54 EDT
*** Bug 1559 has been marked as a duplicate of this bug. ***
Comment 7 Raghavendra G 2010-10-19 23:25:07 EDT
*** Bug 1881 has been marked as a duplicate of this bug. ***
Comment 8 Raghavendra G 2010-11-08 02:03:02 EST
Another symptom of this bug is ping timer getting expired on rdma setup (bugs 2060 and 2043). rdma has a quota mechanism for managing the receive buffers. In order to send a request, there should be non-zero quota available and this quota is returned to pool once a reply has come back. Since finodelk requests are blocked on server side, ping request could not be transmitted to server thereby resulting in expiry of ping timer. On looking through logs, I found that afr did not send UNLOCK request to one of its children. Below are the logs
Observe the lock requests on ino=32623548.

[2010-11-08 00:54:03.954115] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954151] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=WRITE, start=0, len=0, pid=18446744073606080992)
[2010-11-08 00:54:03.954228] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954252] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9483: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954269] I [trace.c:1364:trace_finodelk] trace-0: 9483: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744073670823160)
[2010-11-08 00:54:03.954351] I [trace.c:1183:trace_fxattrop_cbk] trace-0: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954425] I [trace.c:1183:trace_fxattrop_cbk] trace-1: 9482: (op_ret=0, op_errno=22)
[2010-11-08 00:54:03.954441] I [trace.c:1364:trace_finodelk] trace-0: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954468] I [trace.c:1364:trace_finodelk] trace-1: 9482: volume=rdma-replicate-0, (fd =0x2aaaae181090, ino=46006322}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072335397008)
[2010-11-08 00:54:03.954502] I [trace.c:840:trace_flush_cbk] trace-1: 9458: (op_ret=0, op_errno=0)
[2010-11-08 00:54:03.954521] I [trace.c:1364:trace_finodelk] trace-1: 9458: volume=rdma-replicate-0, (fd =0x2aaaae18103c, ino=32623972}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=18446744072292244719)
[2010-11-08 00:54:03.954553] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9485: op_ret=-1, op_errno=11
[2010-11-08 00:54:03.954575] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954590] I [trace.c:1364:trace_finodelk] trace-1: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLK, type=UNLOCK, start=0, len=0, pid=0)
[2010-11-08 00:54:03.954621] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9483: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954640] I [trace.c:1364:trace_finodelk] trace-1: 9483: volume=rdma-replicate-0, (fd =0x2aaaae1
81090, ino=46006322}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=13935544)
[2010-11-08 00:54:03.954710] I [trace.c:1424:trace_lookup] trace-0: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954755] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954790] I [trace.c:1424:trace_lookup] trace-1: 9486: (loc {path=/glusterfs/build/conf1702.sh, ino=46006322})
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954832] I [trace.c:1301:trace_finodelk_cbk] trace-0: 9482: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954882] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9459: op_ret=0, op_errno=0
[2010-11-08 00:54:03.954902] I [trace.c:1822:trace_flush] trace-1: 9459: (*fd=0x2aaaae18103c)
[2010-11-08 00:54:03.955049] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9458: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955123] I [trace.c:1301:trace_finodelk_cbk] trace-1: 9485: op_ret=0, op_errno=0
[2010-11-08 00:54:03.955139] I [trace.c:1364:trace_finodelk] trace-0: 9485: volume=rdma-replicate-0, (fd =0x2aaaae181138, ino=32623548}, cmd=SETLKW, type=WRITE, start=0, len=0, pid=18446744072283501362)

As can be seen from logs, a SETLK request was sent on both trace-0 and trace-1 followed by an UNLOCK request only on trace-1.
Comment 9 Raghavendra G 2010-11-08 04:23:53 EST
Ignore my previous comment related to this bug similar to 2060 and 2043. I was wrong in my analysis. Also ignore the comment related to afr not sending UNLOCK to server.
Comment 10 Raghavendra G 2011-02-22 22:33:31 EST
The bug is no longer reproducible on latest git pull (26cedae57d5b7cb8). Hence marking this bug as resolved.

Note You need to log in before you can comment on or make changes to this bug.