Hide Forgot
Description of problem: bonnie and iozone tests are hung intermittently for stripe over nfs Version-Release number of selected component (if applicable): Mainline How reproducible: Steps to Reproduce: 1. create a stripe volume and mount it over nfs 2. run sanity Actual results: some times bonnie & iozone are hung with out test completion. This is not a consistent behavior. once nfs client hangs unable to kill the process . dmesg says nfs server not responding. machine needs to be rebooted inorder to reclaim mount directory. Expected results: Additional info: 2-03-29 03:27:47.532074] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.535522] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.535575] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.537473] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.537526] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.562032] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.562085] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.565403] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.565454] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.592455] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.592507] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:27:47.620388] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory [2012-03-29 03:27:47.620458] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory [2012-03-29 03:31:52.334110] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7 [2012-03-29 03:31:52.334657] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 68eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address) [2012-03-29 03:31:52.336999] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7 [2012-03-29 03:31:52.337023] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 69eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address) [2012-03-29 03:31:52.338183] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7 [2012-03-29 03:31:52.338207] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6aeacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address) [2012-03-29 03:31:52.338752] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7 [2012-03-29 03:31:52.338784] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6beacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
I ran into this issue again. I could get the kernel stack trace of the gNFS process (/proc/<pid>/stack) [root@QA-49 ~]# cat /proc/16733/stack [<ffffffffa03f9f24>] nfs_wait_bit_killable+0x24/0x40 [nfs] [<ffffffffa0408f49>] nfs_commit_inode+0xa9/0x250 [nfs] [<ffffffffa03f60c6>] nfs_release_page+0x86/0xa0 [nfs] [<ffffffff8110fe60>] try_to_release_page+0x30/0x60 [<ffffffff8112a2a1>] shrink_page_list.clone.0+0x4f1/0x5c0 [<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740 [<ffffffff8112b37f>] shrink_zone+0x38f/0x520 [<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520 [<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130 [<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940 [<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150 [<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370 [<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0 [<ffffffff81042b39>] __do_page_fault+0x139/0x480 [<ffffffff814f248e>] do_page_fault+0x3e/0xa0 [<ffffffff814ef845>] page_fault+0x25/0x30 [<ffffffff814260c9>] skb_copy_datagram_iovec+0x159/0x2c0 [<ffffffff81472235>] tcp_recvmsg+0xca5/0xe90 [<ffffffff8141c1b9>] sock_common_recvmsg+0x39/0x50 [<ffffffff8141bf51>] sock_aio_read+0x181/0x190 [<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140 [<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0 [<ffffffff81177563>] vfs_readv+0x43/0x60 [<ffffffff81177691>] sys_readv+0x51/0xb0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff ----------------------------------------------------------- From the above BT, looks like a readv call on reading network data caused the process to end up in "D" state. [root@QA-49 ~]# ps uaxww | grep nfs root 16733 0.9 10.0 537752 206564 ? Dsl Mar28 17:46 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /etc/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/77970b6a4863ead1be3b3b44b26f2dc5.socket Since the process is in "D" state it can't be attached with gdb, strace and the likes. I'll try to run the test again and see if the issues reproduces with the same BT.
When the client and server are in the same host, this deadlock is known to occur. Can you reproduce this with the two of them in separate machines? Iff the same problem is seen, i request the following: 1. relevant logs (bricks, gNFS) 2. stack trace of bonnie++, gNFS, and brick(s) processes. 3. free -m and top output.
with client and server on different hosts, this issue is not seen