Bug 808079

Summary: Intermittent failures of bonnie and iozone on nfs
Product: [Community] GlusterFS Reporter: shylesh <shmohan>
Component: nfsAssignee: Rajesh <rajesh>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: gluster-bugs, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-25 05:39:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description shylesh 2012-03-29 14:04:44 UTC
Description of problem:
bonnie and iozone tests are hung intermittently for stripe over nfs

Version-Release number of selected component (if applicable):
Mainline

How reproducible:


Steps to Reproduce:
1. create a stripe volume and mount it over nfs
2. run sanity

  
Actual results:
some times bonnie & iozone are hung with out test completion. This is not a consistent behavior. once nfs client hangs unable to kill the process . dmesg says nfs server not responding. machine needs to be rebooted inorder to reclaim mount directory.

Expected results:


Additional info:


2-03-29 03:27:47.532074] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.535522] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.535575] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.537473] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.537526] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.562032] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.562085] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.565403] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.565454] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.592455] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.592507] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.620388] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.620458] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:31:52.334110] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.334657] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 68eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.336999] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.337023] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 69eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.338183] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.338207] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6aeacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.338752] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.338784] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6beacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)

Comment 1 shylesh 2012-03-30 13:35:02 UTC
I ran into this issue again. I could get the kernel stack trace of the gNFS process (/proc/<pid>/stack)

[root@QA-49 ~]# cat /proc/16733/stack 
[<ffffffffa03f9f24>] nfs_wait_bit_killable+0x24/0x40 [nfs]
[<ffffffffa0408f49>] nfs_commit_inode+0xa9/0x250 [nfs]
[<ffffffffa03f60c6>] nfs_release_page+0x86/0xa0 [nfs]
[<ffffffff8110fe60>] try_to_release_page+0x30/0x60
[<ffffffff8112a2a1>] shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740
[<ffffffff8112b37f>] shrink_zone+0x38f/0x520
[<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520
[<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130
[<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940
[<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150
[<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370
[<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0
[<ffffffff81042b39>] __do_page_fault+0x139/0x480
[<ffffffff814f248e>] do_page_fault+0x3e/0xa0
[<ffffffff814ef845>] page_fault+0x25/0x30
[<ffffffff814260c9>] skb_copy_datagram_iovec+0x159/0x2c0
[<ffffffff81472235>] tcp_recvmsg+0xca5/0xe90
[<ffffffff8141c1b9>] sock_common_recvmsg+0x39/0x50
[<ffffffff8141bf51>] sock_aio_read+0x181/0x190
[<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140
[<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0
[<ffffffff81177563>] vfs_readv+0x43/0x60
[<ffffffff81177691>] sys_readv+0x51/0xb0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

-----------------------------------------------------------

From the above BT, looks like a readv call on reading network data caused the process to end up in "D" state.

[root@QA-49 ~]# ps uaxww | grep nfs
root     16733  0.9 10.0 537752 206564 ?       Dsl  Mar28  17:46 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /etc/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/77970b6a4863ead1be3b3b44b26f2dc5.socket


Since the process is in "D" state it can't be attached with gdb, strace and the likes.

I'll try to run the test again and see if the issues reproduces with the same BT.

Comment 2 Rajesh 2012-04-05 09:17:52 UTC
When the client and server are in the same host, this deadlock is known to occur.
Can you reproduce this with the two of them in separate machines?

Iff the same problem is seen, i request the following:
1. relevant logs (bricks, gNFS)
2. stack trace of bonnie++, gNFS, and brick(s) processes.
3. free -m and top output.

Comment 3 Rajesh 2012-04-25 05:39:40 UTC
with client and server on different hosts, this issue is not seen