Description of problem: volume type: 6x2 executed iozone over nfs mount [root@bigbend1 ~]# gluster volume info Volume Name: dist-rep Type: Distributed-Replicate Volume ID: 8861323a-fe95-4772-812e-254838ec18ad Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.37.155:/rhs/brick1/d1r1 Brick2: 10.70.37.100:/rhs/brick1/d1r2 Brick3: 10.70.37.121:/rhs/brick1/d2r1 Brick4: 10.70.37.211:/rhs/brick1/d2r2 Brick5: 10.70.37.155:/rhs/brick1/d3r1 Brick6: 10.70.37.100:/rhs/brick1/d3r2 Brick7: 10.70.37.121:/rhs/brick1/d4r1 Brick8: 10.70.37.211:/rhs/brick1/d4r2 Brick9: 10.70.37.155:/rhs/brick1/d5r1 Brick10: 10.70.37.100:/rhs/brick1/d5r2 Brick11: 10.70.37.121:/rhs/brick1/d6r1 Brick12: 10.70.37.211:/rhs/brick1/d6r2 [root@bigbend1 ~]# gluster volume status dist-rep Status of volume: dist-rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.155:/rhs/brick1/d1r1 49152 Y 2545 Brick 10.70.37.100:/rhs/brick1/d1r2 49152 Y 27896 Brick 10.70.37.121:/rhs/brick1/d2r1 49152 Y 27953 Brick 10.70.37.211:/rhs/brick1/d2r2 49152 Y 27680 Brick 10.70.37.155:/rhs/brick1/d3r1 49153 Y 2556 Brick 10.70.37.100:/rhs/brick1/d3r2 49153 Y 27905 Brick 10.70.37.121:/rhs/brick1/d4r1 49153 Y 27962 Brick 10.70.37.211:/rhs/brick1/d4r2 49153 Y 27689 Brick 10.70.37.155:/rhs/brick1/d5r1 49154 Y 2568 Brick 10.70.37.100:/rhs/brick1/d5r2 49154 Y 27914 Brick 10.70.37.121:/rhs/brick1/d6r1 49154 Y 27971 Brick 10.70.37.211:/rhs/brick1/d6r2 49154 Y 27698 NFS Server on localhost 2049 Y 2585 Self-heal Daemon on localhost N/A Y 2589 NFS Server on 60f1a154-825b-445a-95f8-79aecd8279c4 2049 Y 27983 Self-heal Daemon on 60f1a154-825b-445a-95f8-79aecd8279c 4 N/A Y 27987 NFS Server on 544d7afc-6979-4372-9f0c-c260727fd578 2049 Y 27708 Self-heal Daemon on 544d7afc-6979-4372-9f0c-c260727fd57 8 N/A Y 27715 NFS Server on 527feaa6-4a22-4a17-847b-19d4005789cc 2049 Y 27924 Self-heal Daemon on 527feaa6-4a22-4a17-847b-19d4005789c c N/A Y 27931 There are no active volume tasks on client side, 10.70.37.155:/dist-rep on /mnt/nfs-test type nfs (rw,vers=3,addr=10.70.37.155) Version-Release number of selected component (if applicable): glusterfs-3.4.0.4rhs-1.el6rhs.x86_64 How reproducible: Happened this time Steps to Reproduce: 1. create volume, start volume 2. nfs mount 3. start "iozone -a" on mount point Actual results: nfs mount hung, iozone hung, patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-05-08 09:35:47configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.4rhs [2013-05-08 09:36:29.447550] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-dist-rep-client-0: server 10.70.37.155:49152 has not responded in the last 42 seconds, disconnecting. [2013-05-08 09:36:54.454846] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-dist-rep-client-1: server 10.70.37.100:49152 has not responded in the last 42 seconds, disconnecting. [2013-05-08 10:05:50.751829] E [rpc-clnt.c:207:call_bail] 0-dist-rep-client-0: bailing out frame type(GlusterFS Handshake) op(PING(3)) xid = 0x500436x sent = 2013-05-08 09:35:47.440386. timeout = 1800 [2013-05-08 10:05:50.752009] W [client-handshake.c:276:client_ping_cbk] 0-dist-rep-client-0: timer must have expired [2013-05-08 10:05:50.752072] E [rpc-clnt.c:207:call_bail] 0-dist-rep-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x500435x sent = 2013-05-08 09:35:46.858048. timeout = 1800 [2013-05-08 10:05:50.752110] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-dist-rep-client-0: remote operation failed: Transport endpoint is not connected (END) /var/log/core does not have any dumps. On client, [root@rhsauto020 ~]# nfsstat Client rpc stats: calls retrans authrefrsh 11827176 18881 11827175 Client nfs v3: null getattr setattr lookup access readlink 0 0% 800833 6% 896414 7% 1613257 13% 2331115 19% 0 0% read write create mkdir symlink mknod 290921 2% 387695 3% 896357 7% 904952 7% 0 0% 0 0% remove rmdir rename link readdir readdirplus 776507 6% 788259 6% 26 0% 0 0% 1035 0% 2138928 18% fsstat fsinfo pathconf commit 212 0% 18 0% 9 0% 629 0% [root@rhsauto020 ~]# dmesg (...last part glimpse) nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying INFO: task iozone:15595 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000000000000 0 15595 15576 0x00000080 ffff880119337cc8 0000000000000082 0000000000000000 0003954793c03bea ffff880119337c38 ffff880119ad86e0 00000000082b3ee6 ffffffffae7c81fb ffff880119913058 ffff880119337fd8 000000000000fb88 ffff880119913058 Call Trace: [<ffffffff81119cd0>] ? sync_page+0x0/0x50 [<ffffffff8150de73>] io_schedule+0x73/0xc0 [<ffffffff81119d0d>] sync_page+0x3d/0x50 [<ffffffff8150e82f>] __wait_on_bit+0x5f/0x90 [<ffffffff81119f43>] wait_on_page_bit+0x73/0x80 [<ffffffff81096cc0>] ? wake_bit_function+0x0/0x50 [<ffffffff8112ef45>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111a36b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff8111a538>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811b1a0e>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811b1add>] vfs_fsync+0x1d/0x20 [<ffffffff811b1b1e>] do_fsync+0x3e/0x60 [<ffffffff811b1b70>] sys_fsync+0x10/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task iozone:15595 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000000000000 0 15595 15576 0x00000080 ffff880119337cc8 0000000000000082 0000000000000000 0003954793c03bea ffff880119337c38 ffff880119ad86e0 00000000082b3ee6 ffffffffae7c81fb ffff880119913058 ffff880119337fd8 000000000000fb88 ffff880119913058 Call Trace: [<ffffffff81119cd0>] ? sync_page+0x0/0x50 [<ffffffff8150de73>] io_schedule+0x73/0xc0 [<ffffffff81119d0d>] sync_page+0x3d/0x50 [<ffffffff8150e82f>] __wait_on_bit+0x5f/0x90 [<ffffffff81119f43>] wait_on_page_bit+0x73/0x80 [<ffffffff81096cc0>] ? wake_bit_function+0x0/0x50 [<ffffffff8112ef45>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111a36b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff8111a538>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811b1a0e>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811b1add>] vfs_fsync+0x1d/0x20 [<ffffffff811b1b1e>] do_fsync+0x3e/0x60 [<ffffffff811b1b70>] sys_fsync+0x10/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying [root@rhsauto020 ~]# ps -auxww | grep iozone Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 15595 0.1 1.3 63500 52244 pts/1 D+ May08 1:09 /opt/qa/tools/iozone -a root 16039 0.0 0.0 103236 836 pts/2 S+ 11:11 0:00 grep iozone Expected results: Iozone should finish execution and nfs if crashes in this manner, should dump cores. Additional info:
found same issue with latest bigbend rpms i.e. "glusterfs-3.4.0.5rhs-1.el6rhs.x86_64" nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying INFO: task iozone:5588 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. iozone D 0000000000000001 0 5588 1850 0x00000080 ffff88007ccf3cc8 0000000000000086 0000000000000000 0000d3d00744655c ffff88007ccf3c38 ffff880037c57600 00000000002e898c ffffffffae748e09 ffff88007a5f5098 ffff88007ccf3fd8 000000000000fb88 ffff88007a5f5098 Call Trace: [<ffffffff81119db0>] ? sync_page+0x0/0x50 [<ffffffff8150e0f3>] io_schedule+0x73/0xc0 [<ffffffff81119ded>] sync_page+0x3d/0x50 [<ffffffff8150eaaf>] __wait_on_bit+0x5f/0x90 [<ffffffff8111a023>] wait_on_page_bit+0x73/0x80 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 [<ffffffff8112f025>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111a44b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff8111a618>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811b1aee>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811b1bbd>] vfs_fsync+0x1d/0x20 [<ffffffff811b1bfe>] do_fsync+0x3e/0x60 [<ffffffff811b1c50>] sys_fsync+0x10/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying nfs: server 10.70.37.155 not responding, still trying
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html