Bug 783973

Summary: [4e92c58d27b5cea9d7346d6dd88be9b3479c9e3b]: application on nfs client hung
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WORKSFORME QA Contact: Raghavendra Bhat <rabhat>
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: gluster-bugs, rajesh, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-13 09:54:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
statedump of nfs server process
none
program of fs-perf-test tool which caused this issue none

Description Raghavendra Bhat 2012-01-23 12:11:20 UTC
Description of problem:
On a replicate volume with replica count 2, had 1 fuse and 1 nfs client. On nfs client running fs-perf-test with 1024 fds as argument. After some time the tests were hung.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
nfs server log messages:

[2012-01-23 16:41:54.936212] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936225] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936309] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/
usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_
1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed
[2012-01-23 16:41:54.936328] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936341] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936446] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/
usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_
1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed
[2012-01-23 16:41:54.936466] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936479] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936560] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/
usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_
1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed
[2012-01-23 16:41:54.936580] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936592] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936675] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/
usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_
1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed
[2012-01-23 16:41:54.936704] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936718] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936771] W [client-lk.c:384:delete_granted_locks_owner] 0-mirror-client-0: fdctx not valid
[2012-01-23 16:41:54.936834] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/
usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_
1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed
[2012-01-23 16:41:54.936853] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed
[2012-01-23 16:41:54.936866] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.936941] I [client-handshake.c:917:client_setvolume_cbk] 0-mirror-client-1: Connected to 127.0.0.1:24010, attached to remo
te volume '/mnt/sda8/export2'.
[2012-01-23 16:41:54.936961] I [afr-common.c:3473:afr_notify] 0-mirror-replicate-0: subvol 1 came up, start crawl
[2012-01-23 16:41:54.937017] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.937034] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid 
argument
[2012-01-23 16:41:54.937104] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.937120] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid 
argument
[2012-01-23 16:41:54.937170] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.937187] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid 
argument
[2012-01-23 16:41:54.937264] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
[2012-01-23 16:41:54.937282] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid 
argument
[2012-01-23 16:41:54.937337] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument
:


Expected results:


Additional info:

dmesg says this:
[25283.748598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25283.748601] glusterfsd      D 0000000000000001     0  5971      1 0x00000000
[25283.748606]  ffff8800af3d5468 0000000000000086 ffff8800af3d5fd8 ffff8800af3d4000
[25283.748610]  0000000000013d00 ffff880094aa5f38 ffff8800af3d5fd8 0000000000013d00
[25283.748614]  ffff880137b38000 ffff880094aa5b80 ffff8800bac876e0 ffff8800baa53d00
[25283.748618] Call Trace:
[25283.748634]  [<ffffffffa05fe6b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[25283.748639]  [<ffffffff815c0980>] io_schedule+0x70/0xc0
[25283.748647]  [<ffffffffa05fe6be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
[25283.748651]  [<ffffffff815c12ff>] __wait_on_bit+0x5f/0x90
[25283.748658]  [<ffffffffa05fe6b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[25283.748662]  [<ffffffff815c13ac>] out_of_line_wait_on_bit+0x7c/0x90
[25283.748666]  [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50
[25283.748670]  [<ffffffff81124507>] ? vma_prio_tree_next+0x47/0x70
[25283.748678]  [<ffffffffa05fea66>] nfs_wait_on_request+0x36/0x40 [nfs]
[25283.748686]  [<ffffffffa0603038>] nfs_find_and_lock_request+0x88/0xb0 [nfs]
[25283.748694]  [<ffffffffa0604e51>] nfs_migrate_page+0x41/0x100 [nfs]
[25283.748699]  [<ffffffff8115957a>] move_to_new_page+0x8a/0x120
[25283.748703]  [<ffffffff81159aa6>] unmap_and_move+0x2f6/0x350
[25283.748707]  [<ffffffff8114e7b0>] ? compaction_alloc+0x0/0x70
[25283.748710]  [<ffffffff81159cf2>] migrate_pages+0xa2/0x150
[25283.748714]  [<ffffffff8114e984>] compact_zone+0xa4/0x220
[25283.748717]  [<ffffffff8114eccc>] compact_zone_order+0xac/0xf0
[25283.748721]  [<ffffffff8111344e>] ? prep_new_page+0x18e/0x1b0
[25283.748725]  [<ffffffff8114eddd>] try_to_compact_pages+0xcd/0x100
[25283.748728]  [<ffffffff811140ff>] __alloc_pages_direct_compact+0xcf/0x190
[25283.748732]  [<ffffffff811145fd>] __alloc_pages_nodemask+0x43d/0x840
[25283.748736]  [<ffffffff81149f45>] alloc_pages_current+0xa5/0x110
[25283.748739]  [<ffffffff81153695>] new_slab+0x1f5/0x290
[25283.748742]  [<ffffffff81155132>] __slab_alloc+0x1b2/0x390
[25283.748747]  [<ffffffff81513321>] ? sk_stream_alloc_skb+0x41/0x130
[25283.748750]  [<ffffffff811584eb>] __kmalloc_node_track_caller+0x9b/0x1a0
[25283.748754]  [<ffffffff81513321>] ? sk_stream_alloc_skb+0x41/0x130
[25283.748758]  [<ffffffff814c5473>] __alloc_skb+0x83/0x170
[25283.748762]  [<ffffffff81513321>] sk_stream_alloc_skb+0x41/0x130
[25283.748766]  [<ffffffff815137ec>] tcp_sendmsg+0x3dc/0xc80
[25283.748770]  [<ffffffff81057310>] ? reweight_entity+0x150/0x160
[25283.748774]  [<ffffffff81538d04>] inet_sendmsg+0x64/0xb0
[25283.748779]  [<ffffffff812ad947>] ? apparmor_socket_sendmsg+0x17/0x20
[25283.748782]  [<ffffffff814bab4e>] sock_aio_write+0x14e/0x160
[25283.748785]  [<ffffffff812e0f7e>] ? rb_erase+0xce/0x150
[25283.748788]  [<ffffffff814baa00>] ? sock_aio_write+0x0/0x160
[25283.748792]  [<ffffffff81165383>] do_sync_readv_writev+0xd3/0x110
[25283.748795]  [<ffffffff8109b2e4>] ? futex_wait+0x204/0x300
[25283.748800]  [<ffffffff812ae7e8>] ? apparmor_file_permission+0x18/0x20
[25283.748804]  [<ffffffff81279bac>] ? security_file_permission+0x2c/0xb0
[25283.748807]  [<ffffffff81164ab1>] ? rw_verify_area+0x61/0xf0
[25283.748810]  [<ffffffff81165654>] do_readv_writev+0xd4/0x1e0
[25283.748814]  [<ffffffff8109ce7b>] ? do_futex+0xbb/0x210
[25283.748817]  [<ffffffff811657a8>] vfs_writev+0x48/0x60
[25283.748820]  [<ffffffff81165931>] sys_writev+0x51/0xc0
[25283.748824]  [<ffffffff8100c002>] system_call_fastpath+0x16/0x1b

Comment 1 Raghavendra Bhat 2012-01-23 12:13:13 UTC
Created attachment 556947 [details]
statedump of nfs server process

Comment 2 Raghavendra Bhat 2012-01-23 12:13:57 UTC
Created attachment 556951 [details]
program of fs-perf-test tool which caused this issue

Comment 3 Rajesh 2012-03-15 10:45:50 UTC
Hi johnny, is it consistently reproducible?

Comment 4 Rajesh 2012-04-04 11:15:42 UTC
this looks like a bug in afr. reassigning to pranith.

Comment 5 Pranith Kumar K 2012-05-25 10:31:17 UTC
This bug is never recreated. Removing it from the blockers list.

Comment 6 Pranith Kumar K 2012-06-13 09:54:28 UTC
I dont see any problem with the afr [f]inodelk code at the moment and it never happened after the very first time. Closing with works for me.