Bug 1203901 - NFS: IOZone tests hang, disconnects and hung tasks seen in logs.
Summary: NFS: IOZone tests hang, disconnects and hung tasks seen in logs.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.1.0
Assignee: Shyamsundar
QA Contact: Ben Turner
URL:
Whiteboard:
: 1203850 (view as bug list)
Depends On:
Blocks: 1191838 1202842 1204580 1224016 1227167
TreeView+ depends on / blocked
 
Reported: 2015-03-19 22:17 UTC by Ben Turner
Modified: 2016-09-17 14:44 UTC (History)
17 users (show)

Fixed In Version: glusterfs-3.7.1-2
Doc Type: Bug Fix
Doc Text:
Previously, the Gluster NFS server failed to process RPC requests because of certain deadlocks in the code. This occured when there were frequent disconnections post the I/O operations from the NFS clients. Due to this, NFS clients or the mount became unresponsive. With this release, this issue is resolved and the NFS clients are responsive.
Clone Of:
: 1224016 (view as bug list)
Environment:
Last Closed: 2015-07-29 04:39:39 UTC
Target Upstream Version:


Attachments (Terms of Use)
SOS from node with NFS process at 100% CPU (7.30 MB, application/x-xz)
2015-04-08 18:09 UTC, Ben Turner
no flags Details
gstack from server (54.18 KB, text/plain)
2015-04-10 19:14 UTC, Ben Turner
no flags Details
sos from server with brick process at 100% (7.32 MB, application/x-xz)
2015-04-10 19:41 UTC, Ben Turner
no flags Details
statedump (468.90 KB, text/plain)
2015-04-10 19:44 UTC, Ben Turner
no flags Details
statedump from same server (468.90 KB, text/plain)
2015-04-10 19:44 UTC, Ben Turner
no flags Details
statedump 3 (31.28 KB, text/plain)
2015-04-10 20:18 UTC, Ben Turner
no flags Details
SD 4 (29.96 KB, text/plain)
2015-04-10 20:18 UTC, Ben Turner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Ben Turner 2015-03-19 22:17:03 UTC
Description of problem:

Running multi node sequential writes from iozone gnfs lead to disconnects and hung tasks:

Mar 19 18:03:43 gqac022 kernel: INFO: task iozone:16077 blocked for more than 120 seconds.
Mar 19 18:03:43 gqac022 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1
Mar 19 18:03:43 gqac022 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 19 18:03:43 gqac022 kernel: iozone        D 0000000000000008     0 16077      1 0x00000080
Mar 19 18:03:43 gqac022 kernel: ffff8806231bdcc8 0000000000000086 0000000000000000 ffffffffa0498940
Mar 19 18:03:43 gqac022 kernel: ffff8806231bdc38 ffffffff810149b9 ffff8806231bdc78 ffffffff810a70a1
Mar 19 18:03:43 gqac022 kernel: ffff880624317ab8 ffff8806231bdfd8 000000000000fbc8 ffff880624317ab8
Mar 19 18:03:43 gqac022 kernel: Call Trace:
Mar 19 18:03:43 gqac022 kernel: [<ffffffffa0498940>] ? rpc_execute+0x50/0xa0 [sunrpc]
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111f930>] ? sync_page+0x0/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff815280a3>] io_schedule+0x73/0xc0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111f96d>] sync_page+0x3d/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81528b6f>] __wait_on_bit+0x5f/0x90
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111fba3>] wait_on_page_bit+0x73/0x80
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81135bf5>] ? pagevec_lookup_tag+0x25/0x40
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111ffcb>] wait_on_page_writeback_range+0xfb/0x190
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81120198>] filemap_write_and_wait_range+0x78/0x90
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811baa3e>] vfs_fsync_range+0x7e/0x100
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811bab2d>] vfs_fsync+0x1d/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811bab6e>] do_fsync+0x3e/0x60
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811babc0>] sys_fsync+0x10/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Version-Release number of selected component (if applicable):

glusterfs-3.6.0.53

How reproducible:

Every run on this build.

Steps to Reproduce:
1.  Run multi host iozone sequential write test from a client.
2.
3.

Actual results:

Hung tasks and disconnects.

Expected results:

Normal operation.

Additional info:

Comment 1 Ben Turner 2015-03-19 22:20:14 UTC
Client log:

Mar 19 18:02:40 gqac022 kernel: nfs: server gqas014.sbu.lab.eng.bos.redhat.com not responding, still trying
Mar 19 18:02:40 gqac022 kernel: nfs: server gqas014.sbu.lab.eng.bos.redhat.com not responding, still trying
Mar 19 18:02:40 gqac022 kernel: nfs: server gqas014.sbu.lab.eng.bos.redhat.com not responding, still trying
Mar 19 18:03:43 gqac022 kernel: INFO: task iozone:16077 blocked for more than 120 seconds.
Mar 19 18:03:43 gqac022 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1
Mar 19 18:03:43 gqac022 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 19 18:03:43 gqac022 kernel: iozone        D 0000000000000008     0 16077      1 0x00000080
Mar 19 18:03:43 gqac022 kernel: ffff8806231bdcc8 0000000000000086 0000000000000000 ffffffffa0498940
Mar 19 18:03:43 gqac022 kernel: ffff8806231bdc38 ffffffff810149b9 ffff8806231bdc78 ffffffff810a70a1
Mar 19 18:03:43 gqac022 kernel: ffff880624317ab8 ffff8806231bdfd8 000000000000fbc8 ffff880624317ab8
Mar 19 18:03:43 gqac022 kernel: Call Trace:
Mar 19 18:03:43 gqac022 kernel: [<ffffffffa0498940>] ? rpc_execute+0x50/0xa0 [sunrpc]
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff810a70a1>] ? ktime_get_ts+0xb1/0xf0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111f930>] ? sync_page+0x0/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff815280a3>] io_schedule+0x73/0xc0
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111f96d>] sync_page+0x3d/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81528b6f>] __wait_on_bit+0x5f/0x90
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111fba3>] wait_on_page_bit+0x73/0x80
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81135bf5>] ? pagevec_lookup_tag+0x25/0x40
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8111ffcb>] wait_on_page_writeback_range+0xfb/0x190
Mar 19 18:03:43 gqac022 kernel: [<ffffffff81120198>] filemap_write_and_wait_range+0x78/0x90
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811baa3e>] vfs_fsync_range+0x7e/0x100
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811bab2d>] vfs_fsync+0x1d/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811bab6e>] do_fsync+0x3e/0x60
Mar 19 18:03:43 gqac022 kernel: [<ffffffff811babc0>] sys_fsync+0x10/0x20
Mar 19 18:03:43 gqac022 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Server nfs.log:
[2015-03-19 21:25:15.649714] I [dict.c:371:dict_get] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.so(io_stats_getxattr_cbk+0xf6) [0x7fe406e89a86] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(nfs_fop_getxattr_cbk+0x2f) [0x7fe406c3391f] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(acl3_default_getacl_cbk+0x67) [0x7fe406c5dc67]))) 0-dict: !this || key=system.posix_acl_default
[2015-03-19 21:25:15.650372] I [dict.c:371:dict_get] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.so(io_stats_getxattr_cbk+0xf6) [0x7fe406e89a86] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(nfs_fop_getxattr_cbk+0x2f) [0x7fe406c3391f] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(acl3_getacl_cbk+0x8d) [0x7fe406c5de5d]))) 0-dict: !this || key=system.posix_acl_access
[2015-03-19 21:28:31.349700] I [dict.c:371:dict_get] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.so(io_stats_getxattr_cbk+0xf6) [0x7fe406e89a86] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(nfs_fop_getxattr_cbk+0x2f) [0x7fe406c3391f] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(acl3_default_getacl_cbk+0x67) [0x7fe406c5dc67]))) 0-dict: !this || key=system.posix_acl_default
[2015-03-19 21:28:31.350218] I [dict.c:371:dict_get] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.so(io_stats_getxattr_cbk+0xf6) [0x7fe406e89a86] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(nfs_fop_getxattr_cbk+0x2f) [0x7fe406c3391f] (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so(acl3_getacl_cbk+0x8d) [0x7fe406c5de5d]))) 0-dict: !this || key=system.posix_acl_access
[2015-03-19 21:31:35.121338] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server 192.168.61.101:49152 has not responded in the last 42 seconds, disconnecting.
[2015-03-19 21:31:35.135105] E [rpc-clnt.c:362:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f) [0x7fe412599d3f] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91) [0x7fe412599771] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fe4125996be]))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2015-03-19 21:25:16.725164 (xid=0x63)
[2015-03-19 21:31:35.135141] W [client-rpc-fops.c:970:client3_3_fsync_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected
[2015-03-19 21:31:35.141353] W [afr-transaction.c:1710:afr_changelog_fsync_cbk] 0-testvol-replicate-0: fsync(a5db21e6-49e8-4e6d-996b-e9d2038701d9) failed on subvolume testvol-client-1. Transaction was WRITE
[2015-03-19 21:31:35.142058] W [client-rpc-fops.c:1804:client3_3_fxattrop_cbk] 0-testvol-client-0: remote operation failed: File descriptor in bad state

Comment 3 Ben Turner 2015-03-20 00:46:57 UTC
It looks like the NFS process on the back end is just spinning a CPU:

Tasks: 722 total,   4 running, 718 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.0%us,  6.2%sy,  0.0%ni, 83.2%id,  2.5%wa,  0.0%hi,  2.2%si,  0.0%st
Mem:  49409280k total, 49167472k used,   241808k free,     3180k buffers
Swap: 24772604k total,        0k used, 24772604k free, 46809564k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
21458 root      20   0  907m  95m 3652 R 100.0  0.2  36:53.99 glusterfs                                                                     
22374 root      20   0 2129m  76m 3148 S 26.5  0.2  42:02.27 glusterfsd                                                                     
15030 root      20   0     0    0    0 S 26.2  0.0  11:44.83 xfsdatad/0                                                                     
22375 root      20   0 2129m  76m 3148 S 24.8  0.2  42:10.24 glusterfsd                                                                     
21434 root      20   0 2129m  76m 3148 R 23.2  0.2  44:31.31 glusterfsd                                                                     
21435 root      20   0 2129m  76m 3148 S 22.5  0.2  44:29.64 glusterfsd                                                                     
15778 root      20   0     0    0    0 S 22.2  0.0   0:32.77 flush-253:6                                                                    
15888 root      20   0 2129m  76m 3148 S 19.2  0.2   0:17.20 glusterfsd                                                                     
15886 root      20   0 2129m  76m 3148 S 18.5  0.2   0:17.15 glusterfsd                                                                     
15887 root      20   0 2129m  76m 3148 S 18.2  0.2   0:16.98 glusterfsd                                                                     
15780 root      20   0 2129m  76m 3148 S 17.9  0.2   0:52.42 glusterfsd                                                                     
15893 root      20   0 2129m  76m 3148 R 17.9  0.2   0:16.86 glusterfsd 

[root@gqas014 ~]# pstack 21458
Thread 1 (process 21458):
#0  0x0000003fb9e0c395 in ?? () from /lib64/libpthread.so.0
#1  0x00007f370530c0bf in event_slot_get () from /usr/lib64/libglusterfs.so.0
#2  0x00007f370530c624 in event_select_on_epoll () from /usr/lib64/libglusterfs.so.0
#3  0x00007f36fb0c3ef7 in socket_throttle () from /usr/lib64/glusterfs/3.6.0.53/rpc-transport/socket.so
#4  0x00007f3705082d41 in rpcsvc_request_outstanding () from /usr/lib64/libgfrpc.so.0
#5  0x00007f3705082dcd in rpcsvc_request_destroy () from /usr/lib64/libgfrpc.so.0
#6  0x00007f3705083036 in rpcsvc_submit_generic () from /usr/lib64/libgfrpc.so.0
#7  0x00007f36f972f090 in nfs3svc_submit_reply () from /usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so
#8  0x00007f36f972fcd6 in nfs3_write_reply () from /usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so
#9  0x00007f36f973210d in nfs3svc_write_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so
#10 0x00007f36f972301b in nfs_fop_writev_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.so
#11 0x00007f36f9977d1e in io_stats_writev_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.so
#12 0x00007f36f9b8931c in wb_do_unwinds () from /usr/lib64/glusterfs/3.6.0.53/xlator/performance/write-behind.so
#13 0x00007f36f9b8ba7c in wb_process_queue () from /usr/lib64/glusterfs/3.6.0.53/xlator/performance/write-behind.so
#14 0x00007f36f9b8bc2f in wb_fulfill_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/performance/write-behind.so
#15 0x00007f36f9dd07f8 in dht_writev_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/distribute.so
#16 0x00007f36fa00ef01 in afr_writev_unwind () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#17 0x00007f36fa00ef8c in afr_transaction_writev_unwind () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#18 0x00007f36fa0130eb in afr_writev_done () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#19 0x00007f36fa0343b9 in afr_unlock_common_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#20 0x00007f36fa035dbe in afr_unlock_inodelk_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#21 0x00007f36fa0368ea in afr_unlock_inodelk () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#22 0x00007f36fa036c9b in afr_unlock () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#23 0x00007f36fa01aa8f in afr_changelog_post_op_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/cluster/replicate.so
#24 0x00007f36fa285331 in client3_3_fxattrop_cbk () from /usr/lib64/glusterfs/3.6.0.53/xlator/protocol/client.so
#25 0x00007f3705088895 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0
#26 0x00007f3705089d22 in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0
#27 0x00007f37050854f8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#28 0x00007f36fb0ca2fd in socket_event_poll_in () from /usr/lib64/glusterfs/3.6.0.53/rpc-transport/socket.so
#29 0x00007f36fb0cbded in socket_event_handler () from /usr/lib64/glusterfs/3.6.0.53/rpc-transport/socket.so
#30 0x00007f370530c470 in event_dispatch_epoll_worker () from /usr/lib64/libglusterfs.so.0
#31 0x0000003fb9e079d1 in start_thread () from /lib64/libpthread.so.0
#32 0x0000003fb9ae88fd in clone () from /lib64/libc.so.6

What other info can I provide to help debug?

Comment 4 Niels de Vos 2015-03-23 08:36:31 UTC
(In reply to Ben Turner from comment #1)
...
> Server nfs.log:
> [2015-03-19 21:25:15.649714] I [dict.c:371:dict_get]
> (-->/usr/lib64/glusterfs/3.6.0.53/xlator/debug/io-stats.
> so(io_stats_getxattr_cbk+0xf6) [0x7fe406e89a86]
> (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.
> so(nfs_fop_getxattr_cbk+0x2f) [0x7fe406c3391f]
> (-->/usr/lib64/glusterfs/3.6.0.53/xlator/nfs/server.
> so(acl3_default_getacl_cbk+0x67) [0x7fe406c5dc67]))) 0-dict: !this ||
> key=system.posix_acl_default

These messages related to system.posix_acl xattrs has been reported as bug 1196057 (for 3.1) and are not related.

I wanted to have a look at the complete nfs.log on the nfs-server (gqas014.sbu.lab.eng.bos.redhat.com), but it seems the system is re-used for something else already?

Have the logs been copied to an other system so that I can inspect them? What was the timezone of the client /var/log/messages from comment #1? I would need to match the times roughly with the nfs.log, because the snippets have 18:02:40 (TZ?) vs 21:25:15 (UTC).

> 1.  Run multi host iozone sequential write test from a client.
Also, the steps to reproduce do not really contain many details about the test itself... A pointer to the test-case and docs on how to run it would be welcome.

Comment 7 Pavithra 2015-03-26 06:04:39 UTC
Hi Shyam,

I read over an email chain that this bug is a known issue for 3.0.4. Could you please fill out the doc text?

Comment 8 Pavithra 2015-04-01 09:40:47 UTC
Hi Ben,

Could you please review the text in the doc text field?
This bug will be documented as a known issue in the 3.0.4 Release Notes (republish).

Comment 9 Shyamsundar 2015-04-06 13:23:50 UTC
We could possibly state: "Multi _client_ sequential ..." otherwise looks fine.

Comment 10 Pavithra 2015-04-08 04:42:10 UTC
Incorporated Shyam's comment in the doc text field.

Comment 11 Ben Turner 2015-04-08 18:09:45 UTC
Created attachment 1012350 [details]
SOS from node with NFS process at 100% CPU

Comment 12 Ben Turner 2015-04-08 18:10:01 UTC
Snip from NFS.log:

[2015-04-07 05:51:39.407690] I [dht-layout.c:684:dht_layout_normalize] 0-testvol-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2015-04-07 05:51:43.267282] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-testvol-client-1: changing port to 49152 (from 0)
[2015-04-07 05:51:43.271365] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-testvol-client-2: changing port to 49152 (from 0)
[2015-04-07 05:51:43.275668] I [client-handshake.c:1412:select_server_supported_programs] 0-testvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-04-07 05:51:43.276132] I [client-handshake.c:1200:client_setvolume_cbk] 0-testvol-client-1: Connected to testvol-client-1, attached to remote volume '/bricks/testvol_brick1'.
[2015-04-07 05:51:43.276160] I [client-handshake.c:1210:client_setvolume_cbk] 0-testvol-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2015-04-07 05:51:43.276393] I [client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-1: Server lk version = 1
[2015-04-07 05:51:43.280068] I [client-handshake.c:1412:select_server_supported_programs] 0-testvol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-04-07 05:51:43.280434] I [client-handshake.c:1200:client_setvolume_cbk] 0-testvol-client-2: Connected to testvol-client-2, attached to remote volume '/bricks/testvol_brick2'.
[2015-04-07 05:51:43.280456] I [client-handshake.c:1210:client_setvolume_cbk] 0-testvol-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-04-07 05:51:43.280718] I [client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-2: Server lk version = 1
[2015-04-07 05:51:44.804876] W [socket.c:546:__socket_rwv] 0-testvol-client-0: readv on 192.168.79.100:49152 failed (No data available)
[2015-04-07 05:51:44.804936] I [client.c:2282:client_rpc_notify] 0-testvol-client-0: disconnected from testvol-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2015-04-07 05:51:46.930456] W [socket.c:546:__socket_rwv] 0-testvol-client-3: readv on 192.168.79.103:49152 failed (No data available)
[2015-04-07 05:51:46.930506] I [client.c:2282:client_rpc_notify] 0-testvol-client-3: disconnected from testvol-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2015-04-07 05:51:46.930608] W [socket.c:546:__socket_rwv] 0-testvol-client-2: readv on 192.168.79.102:49152 failed (No data available)
[2015-04-07 05:51:46.930641] I [client.c:2282:client_rpc_notify] 0-testvol-client-2: disconnected from testvol-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2015-04-07 05:51:46.930664] E [MSGID: 108006] [afr-common.c:4527:afr_notify] 0-testvol-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2015-04-07 05:51:46.930709] W [socket.c:546:__socket_rwv] 0-testvol-client-1: readv on 192.168.79.101:49152 failed (No data available)
[2015-04-07 05:51:46.930792] I [client.c:2282:client_rpc_notify] 0-testvol-client-1: disconnected from testvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2015-04-07 05:51:46.930819] E [MSGID: 108006] [afr-common.c:4527:afr_notify] 0-testvol-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2015-04-07 05:51:58.235793] I [MSGID: 100030] [glusterfsd.c:2019:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.6.0.53 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nf
s -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/9ab9f5d0fa80dbb72e3d178d2fbef502.socket)
[2015-04-07 05:51:58.248977] I [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-04-07 05:51:58.253742] I [rpcsvc.c:2142:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 16
[2015-04-07 05:51:58.280618] I [rpc-drc.c:731:rpcsvc_drc_init] 0-rpc-service: DRC is turned OFF
[2015-04-07 05:51:58.280664] I [nfs.c:1354:init] 0-nfs: NFS service started
[2015-04-07 05:51:58.280931] I [dht-shared.c:337:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2015-04-07 05:51:58.284498] I [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2

Comment 13 Ben Turner 2015-04-10 19:14:08 UTC
netstat -na on client:

tcp        0 1829604 192.168.79.105:722          192.168.79.101:2049         ESTABLISHED 

On server:

tcp   2501184      0 192.168.79.103:49152        192.168.79.101:1016         ESTABLISHED 

On other server:
tcp   3731676      0 192.168.79.101:2049         192.168.79.105:722          ESTABLISHED 
tcp        0 1632237 192.168.79.101:1016         192.168.79.103:49152        FIN_WAIT1   

In logs:
[2015-04-10 06:40:17.983396] E [client-rpc-fops.c:1571:client3_3_finodelk_cbk] 0-testvol-client-3: remote operation failed: Invalid argument
[2015-04-10 06:40:17.983489] I [afr-lk-common.c:676:afr_unlock_inodelk_cbk] 0-testvol-replicate-1: (null): unlock failed on subvolume testvol-client-3 with lock owner 3c170240307f0000

I also attached a gstack from the server process that is pegging a CPU.

Comment 14 Ben Turner 2015-04-10 19:14:56 UTC
Created attachment 1013255 [details]
gstack from server

Comment 15 Ben Turner 2015-04-10 19:41:25 UTC
Created attachment 1013262 [details]
sos from server with brick process at 100%

Comment 16 Ben Turner 2015-04-10 19:44:24 UTC
Created attachment 1013263 [details]
statedump

Comment 17 Ben Turner 2015-04-10 19:44:58 UTC
Created attachment 1013264 [details]
statedump from same server

Comment 18 Ben Turner 2015-04-10 20:18:28 UTC
Created attachment 1013268 [details]
statedump 3

Comment 19 Ben Turner 2015-04-10 20:18:51 UTC
Created attachment 1013269 [details]
SD 4

Comment 30 Ben Turner 2015-07-01 16:30:21 UTC
Verified on glusterfs-3.7.1-6.el6rhs.x86_64

Comment 31 Ben Turner 2015-07-07 17:07:22 UTC
*** Bug 1203850 has been marked as a duplicate of this bug. ***

Comment 32 Bhavana 2015-07-15 06:44:59 UTC
Hi Shyam,

Can you please update the doc text in the following format:
_________________________________________________________

Cause:

Consequence:

Fix:

Result:

_______________________________________________________

Thanks,
Bhavana

Comment 33 Shyamsundar 2015-07-27 13:10:30 UTC
Updated the doc. text with the required information.

Comment 34 Bhavana 2015-07-28 05:49:14 UTC
The doc text is updated.

Comment 36 errata-xmlrpc 2015-07-29 04:39:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.