Bug 808079 - Intermittent failures of bonnie and iozone on nfs
Intermittent failures of bonnie and iozone on nfs
Status: CLOSED WORKSFORME
Product: GlusterFS
Classification: Community
Component: nfs (Show other bugs)
pre-release
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Rajesh
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-29 10:04 EDT by shylesh
Modified: 2015-12-01 11:45 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-04-25 01:39:40 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description shylesh 2012-03-29 10:04:44 EDT
Description of problem:
bonnie and iozone tests are hung intermittently for stripe over nfs

Version-Release number of selected component (if applicable):
Mainline

How reproducible:


Steps to Reproduce:
1. create a stripe volume and mount it over nfs
2. run sanity

  
Actual results:
some times bonnie & iozone are hung with out test completion. This is not a consistent behavior. once nfs client hangs unable to kill the process . dmesg says nfs server not responding. machine needs to be rebooted inorder to reclaim mount directory.

Expected results:


Additional info:


2-03-29 03:27:47.532074] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.535522] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.535575] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.537473] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.537526] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.562032] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.562085] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.565403] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.565454] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.592455] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.592507] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:27:47.620388] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-1: remote operation failed: No such file or directory
[2012-03-29 03:27:47.620458] W [client3_1-fops.c:593:client3_1_unlink_cbk] 0-stripe-client-2: remote operation failed: No such file or directory
[2012-03-29 03:31:52.334110] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.334657] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 68eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.336999] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.337023] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 69eacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.338183] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.338207] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6aeacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
[2012-03-29 03:31:52.338752] E [nfs3.c:1516:nfs3_access_resume] 0-nfs-nfsv3: Unable to resolve FH: (172.17.251.89:833) stripe : a2014ca1-ed4f-4431-8229-2dfd9ebca2a7
[2012-03-29 03:31:52.338784] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 6beacaaa, ACCESS: NFS: 2(No such file or directory), POSIX: 14(Bad address)
Comment 1 shylesh 2012-03-30 09:35:02 EDT
I ran into this issue again. I could get the kernel stack trace of the gNFS process (/proc/<pid>/stack)

[root@QA-49 ~]# cat /proc/16733/stack 
[<ffffffffa03f9f24>] nfs_wait_bit_killable+0x24/0x40 [nfs]
[<ffffffffa0408f49>] nfs_commit_inode+0xa9/0x250 [nfs]
[<ffffffffa03f60c6>] nfs_release_page+0x86/0xa0 [nfs]
[<ffffffff8110fe60>] try_to_release_page+0x30/0x60
[<ffffffff8112a2a1>] shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a66b>] shrink_inactive_list+0x2fb/0x740
[<ffffffff8112b37f>] shrink_zone+0x38f/0x520
[<ffffffff8112b60e>] do_try_to_free_pages+0xfe/0x520
[<ffffffff8112bc1d>] try_to_free_pages+0x9d/0x130
[<ffffffff81123b9d>] __alloc_pages_nodemask+0x40d/0x940
[<ffffffff81158c7a>] alloc_pages_vma+0x9a/0x150
[<ffffffff81171bb5>] do_huge_pmd_anonymous_page+0x145/0x370
[<ffffffff8113c52a>] handle_mm_fault+0x25a/0x2b0
[<ffffffff81042b39>] __do_page_fault+0x139/0x480
[<ffffffff814f248e>] do_page_fault+0x3e/0xa0
[<ffffffff814ef845>] page_fault+0x25/0x30
[<ffffffff814260c9>] skb_copy_datagram_iovec+0x159/0x2c0
[<ffffffff81472235>] tcp_recvmsg+0xca5/0xe90
[<ffffffff8141c1b9>] sock_common_recvmsg+0x39/0x50
[<ffffffff8141bf51>] sock_aio_read+0x181/0x190
[<ffffffff8117619b>] do_sync_readv_writev+0xfb/0x140
[<ffffffff8117722f>] do_readv_writev+0xcf/0x1f0
[<ffffffff81177563>] vfs_readv+0x43/0x60
[<ffffffff81177691>] sys_readv+0x51/0xb0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

-----------------------------------------------------------

From the above BT, looks like a readv call on reading network data caused the process to end up in "D" state.

[root@QA-49 ~]# ps uaxww | grep nfs
root     16733  0.9 10.0 537752 206564 ?       Dsl  Mar28  17:46 /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /etc/glusterd/nfs/run/nfs.pid -l /usr/local/var/log/glusterfs/nfs.log -S /tmp/77970b6a4863ead1be3b3b44b26f2dc5.socket


Since the process is in "D" state it can't be attached with gdb, strace and the likes.

I'll try to run the test again and see if the issues reproduces with the same BT.
Comment 2 Rajesh 2012-04-05 05:17:52 EDT
When the client and server are in the same host, this deadlock is known to occur.
Can you reproduce this with the two of them in separate machines?

Iff the same problem is seen, i request the following:
1. relevant logs (bricks, gNFS)
2. stack trace of bonnie++, gNFS, and brick(s) processes.
3. free -m and top output.
Comment 3 Rajesh 2012-04-25 01:39:40 EDT
with client and server on different hosts, this issue is not seen

Note You need to log in before you can comment on or make changes to this bug.