Hide Forgot
Description of problem: On a replicate volume with replica count 2, had 1 fuse and 1 nfs client. On nfs client running fs-perf-test with 1024 fds as argument. After some time the tests were hung. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: nfs server log messages: [2012-01-23 16:41:54.936212] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936225] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936309] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/ usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_ 1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed [2012-01-23 16:41:54.936328] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936341] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936446] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/ usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_ 1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed [2012-01-23 16:41:54.936466] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936479] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936560] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/ usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_ 1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed [2012-01-23 16:41:54.936580] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936592] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936675] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/ usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_ 1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed [2012-01-23 16:41:54.936704] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936718] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936771] W [client-lk.c:384:delete_granted_locks_owner] 0-mirror-client-0: fdctx not valid [2012-01-23 16:41:54.936834] W [rpc-common.c:64:xdr_to_generic] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x29f) [0x7f57af319d5c] (-->/ usr/local/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x20e) [0x7f57af3199fa] (-->/usr/local/lib/glusterfs/3git/xlator/protocol/client.so(client3_ 1_finodelk_cbk+0x92) [0x7f57ac5d5491]))) 0-xdr: XDR decoding failed [2012-01-23 16:41:54.936853] E [client3_1-fops.c:1263:client3_1_finodelk_cbk] 0-mirror-client-1: XDR decoding failed [2012-01-23 16:41:54.936866] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.936941] I [client-handshake.c:917:client_setvolume_cbk] 0-mirror-client-1: Connected to 127.0.0.1:24010, attached to remo te volume '/mnt/sda8/export2'. [2012-01-23 16:41:54.936961] I [afr-common.c:3473:afr_notify] 0-mirror-replicate-0: subvol 1 came up, start crawl [2012-01-23 16:41:54.937017] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.937034] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid argument [2012-01-23 16:41:54.937104] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.937120] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid argument [2012-01-23 16:41:54.937170] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.937187] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid argument [2012-01-23 16:41:54.937264] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument [2012-01-23 16:41:54.937282] E [afr-lk-common.c:567:afr_unlock_inodelk_cbk] 0-mirror-replicate-0: (null): unlock failed on 1, reason: Invalid argument [2012-01-23 16:41:54.937337] W [client3_1-fops.c:1273:client3_1_finodelk_cbk] 0-mirror-client-1: remote operation failed: Invalid argument : Expected results: Additional info: dmesg says this: [25283.748598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [25283.748601] glusterfsd D 0000000000000001 0 5971 1 0x00000000 [25283.748606] ffff8800af3d5468 0000000000000086 ffff8800af3d5fd8 ffff8800af3d4000 [25283.748610] 0000000000013d00 ffff880094aa5f38 ffff8800af3d5fd8 0000000000013d00 [25283.748614] ffff880137b38000 ffff880094aa5b80 ffff8800bac876e0 ffff8800baa53d00 [25283.748618] Call Trace: [25283.748634] [<ffffffffa05fe6b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [25283.748639] [<ffffffff815c0980>] io_schedule+0x70/0xc0 [25283.748647] [<ffffffffa05fe6be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs] [25283.748651] [<ffffffff815c12ff>] __wait_on_bit+0x5f/0x90 [25283.748658] [<ffffffffa05fe6b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [25283.748662] [<ffffffff815c13ac>] out_of_line_wait_on_bit+0x7c/0x90 [25283.748666] [<ffffffff81087f70>] ? wake_bit_function+0x0/0x50 [25283.748670] [<ffffffff81124507>] ? vma_prio_tree_next+0x47/0x70 [25283.748678] [<ffffffffa05fea66>] nfs_wait_on_request+0x36/0x40 [nfs] [25283.748686] [<ffffffffa0603038>] nfs_find_and_lock_request+0x88/0xb0 [nfs] [25283.748694] [<ffffffffa0604e51>] nfs_migrate_page+0x41/0x100 [nfs] [25283.748699] [<ffffffff8115957a>] move_to_new_page+0x8a/0x120 [25283.748703] [<ffffffff81159aa6>] unmap_and_move+0x2f6/0x350 [25283.748707] [<ffffffff8114e7b0>] ? compaction_alloc+0x0/0x70 [25283.748710] [<ffffffff81159cf2>] migrate_pages+0xa2/0x150 [25283.748714] [<ffffffff8114e984>] compact_zone+0xa4/0x220 [25283.748717] [<ffffffff8114eccc>] compact_zone_order+0xac/0xf0 [25283.748721] [<ffffffff8111344e>] ? prep_new_page+0x18e/0x1b0 [25283.748725] [<ffffffff8114eddd>] try_to_compact_pages+0xcd/0x100 [25283.748728] [<ffffffff811140ff>] __alloc_pages_direct_compact+0xcf/0x190 [25283.748732] [<ffffffff811145fd>] __alloc_pages_nodemask+0x43d/0x840 [25283.748736] [<ffffffff81149f45>] alloc_pages_current+0xa5/0x110 [25283.748739] [<ffffffff81153695>] new_slab+0x1f5/0x290 [25283.748742] [<ffffffff81155132>] __slab_alloc+0x1b2/0x390 [25283.748747] [<ffffffff81513321>] ? sk_stream_alloc_skb+0x41/0x130 [25283.748750] [<ffffffff811584eb>] __kmalloc_node_track_caller+0x9b/0x1a0 [25283.748754] [<ffffffff81513321>] ? sk_stream_alloc_skb+0x41/0x130 [25283.748758] [<ffffffff814c5473>] __alloc_skb+0x83/0x170 [25283.748762] [<ffffffff81513321>] sk_stream_alloc_skb+0x41/0x130 [25283.748766] [<ffffffff815137ec>] tcp_sendmsg+0x3dc/0xc80 [25283.748770] [<ffffffff81057310>] ? reweight_entity+0x150/0x160 [25283.748774] [<ffffffff81538d04>] inet_sendmsg+0x64/0xb0 [25283.748779] [<ffffffff812ad947>] ? apparmor_socket_sendmsg+0x17/0x20 [25283.748782] [<ffffffff814bab4e>] sock_aio_write+0x14e/0x160 [25283.748785] [<ffffffff812e0f7e>] ? rb_erase+0xce/0x150 [25283.748788] [<ffffffff814baa00>] ? sock_aio_write+0x0/0x160 [25283.748792] [<ffffffff81165383>] do_sync_readv_writev+0xd3/0x110 [25283.748795] [<ffffffff8109b2e4>] ? futex_wait+0x204/0x300 [25283.748800] [<ffffffff812ae7e8>] ? apparmor_file_permission+0x18/0x20 [25283.748804] [<ffffffff81279bac>] ? security_file_permission+0x2c/0xb0 [25283.748807] [<ffffffff81164ab1>] ? rw_verify_area+0x61/0xf0 [25283.748810] [<ffffffff81165654>] do_readv_writev+0xd4/0x1e0 [25283.748814] [<ffffffff8109ce7b>] ? do_futex+0xbb/0x210 [25283.748817] [<ffffffff811657a8>] vfs_writev+0x48/0x60 [25283.748820] [<ffffffff81165931>] sys_writev+0x51/0xc0 [25283.748824] [<ffffffff8100c002>] system_call_fastpath+0x16/0x1b
Created attachment 556947 [details] statedump of nfs server process
Created attachment 556951 [details] program of fs-perf-test tool which caused this issue
Hi johnny, is it consistently reproducible?
this looks like a bug in afr. reassigning to pranith.
This bug is never recreated. Removing it from the blockers list.
I dont see any problem with the afr [f]inodelk code at the moment and it never happened after the very first time. Closing with works for me.