961198 – nfs: iozone execution hits "D" state and nfs hangs with crash

Bug 961198 - nfs: iozone execution hits "D" state and nfs hangs with crash

Summary: nfs: iozone execution hits "D" state and nfs hangs with crash

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Vivek Agarwal
QA Contact:	Saurabh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-09 05:43 UTC by Saurabh
Modified:	2016-02-18 00:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.4.0.8rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-23 22:39:41 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Saurabh 2013-05-09 05:43:32 UTC

Description of problem:
volume type: 6x2

executed iozone over nfs mount

[root@bigbend1 ~]# gluster volume info
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 8861323a-fe95-4772-812e-254838ec18ad
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.155:/rhs/brick1/d1r1
Brick2: 10.70.37.100:/rhs/brick1/d1r2
Brick3: 10.70.37.121:/rhs/brick1/d2r1
Brick4: 10.70.37.211:/rhs/brick1/d2r2
Brick5: 10.70.37.155:/rhs/brick1/d3r1
Brick6: 10.70.37.100:/rhs/brick1/d3r2
Brick7: 10.70.37.121:/rhs/brick1/d4r1
Brick8: 10.70.37.211:/rhs/brick1/d4r2
Brick9: 10.70.37.155:/rhs/brick1/d5r1
Brick10: 10.70.37.100:/rhs/brick1/d5r2
Brick11: 10.70.37.121:/rhs/brick1/d6r1
Brick12: 10.70.37.211:/rhs/brick1/d6r2


[root@bigbend1 ~]# gluster volume status dist-rep
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.155:/rhs/brick1/d1r1			49152	Y	2545
Brick 10.70.37.100:/rhs/brick1/d1r2			49152	Y	27896
Brick 10.70.37.121:/rhs/brick1/d2r1			49152	Y	27953
Brick 10.70.37.211:/rhs/brick1/d2r2			49152	Y	27680
Brick 10.70.37.155:/rhs/brick1/d3r1			49153	Y	2556
Brick 10.70.37.100:/rhs/brick1/d3r2			49153	Y	27905
Brick 10.70.37.121:/rhs/brick1/d4r1			49153	Y	27962
Brick 10.70.37.211:/rhs/brick1/d4r2			49153	Y	27689
Brick 10.70.37.155:/rhs/brick1/d5r1			49154	Y	2568
Brick 10.70.37.100:/rhs/brick1/d5r2			49154	Y	27914
Brick 10.70.37.121:/rhs/brick1/d6r1			49154	Y	27971
Brick 10.70.37.211:/rhs/brick1/d6r2			49154	Y	27698
NFS Server on localhost					2049	Y	2585
Self-heal Daemon on localhost				N/A	Y	2589
NFS Server on 60f1a154-825b-445a-95f8-79aecd8279c4	2049	Y	27983
Self-heal Daemon on 60f1a154-825b-445a-95f8-79aecd8279c
4							N/A	Y	27987
NFS Server on 544d7afc-6979-4372-9f0c-c260727fd578	2049	Y	27708
Self-heal Daemon on 544d7afc-6979-4372-9f0c-c260727fd57
8							N/A	Y	27715
NFS Server on 527feaa6-4a22-4a17-847b-19d4005789cc	2049	Y	27924
Self-heal Daemon on 527feaa6-4a22-4a17-847b-19d4005789c
c							N/A	Y	27931
 
There are no active volume tasks


on client side,

10.70.37.155:/dist-rep on /mnt/nfs-test type nfs (rw,vers=3,addr=10.70.37.155)


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.4rhs-1.el6rhs.x86_64

How reproducible:
Happened this time

Steps to Reproduce:
1. create volume, start volume
2. nfs mount
3. start "iozone -a" on mount point
  
Actual results:
nfs mount hung, iozone hung,
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-05-08 09:35:47configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.4rhs
[2013-05-08 09:36:29.447550] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-dist-rep-client-0: server 10.70.37.155:49152 has not responded in the last 42 seconds, disconnecting.
[2013-05-08 09:36:54.454846] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-dist-rep-client-1: server 10.70.37.100:49152 has not responded in the last 42 seconds, disconnecting.
[2013-05-08 10:05:50.751829] E [rpc-clnt.c:207:call_bail] 0-dist-rep-client-0: bailing out frame type(GlusterFS Handshake) op(PING(3)) xid = 0x500436x sent = 2013-05-08 09:35:47.440386. timeout = 1800
[2013-05-08 10:05:50.752009] W [client-handshake.c:276:client_ping_cbk] 0-dist-rep-client-0: timer must have expired
[2013-05-08 10:05:50.752072] E [rpc-clnt.c:207:call_bail] 0-dist-rep-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x500435x sent = 2013-05-08 09:35:46.858048. timeout = 1800
[2013-05-08 10:05:50.752110] W [client-rpc-fops.c:866:client3_3_writev_cbk] 0-dist-rep-client-0: remote operation failed: Transport endpoint is not connected
(END) 

/var/log/core does not have any dumps.


On client,

[root@rhsauto020 ~]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
11827176   18881      11827175

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 800833    6% 896414    7% 1613257  13% 2331115  19% 0         0% 
read         write        create       mkdir        symlink      mknod        
290921    2% 387695    3% 896357    7% 904952    7% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
776507    6% 788259    6% 26        0% 0         0% 1035      0% 2138928  18% 
fsstat       fsinfo       pathconf     commit       
212       0% 18        0% 9         0% 629       0% 

[root@rhsauto020 ~]#  dmesg (...last part glimpse)

nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
INFO: task iozone:15595 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 0000000000000000     0 15595  15576 0x00000080
 ffff880119337cc8 0000000000000082 0000000000000000 0003954793c03bea
 ffff880119337c38 ffff880119ad86e0 00000000082b3ee6 ffffffffae7c81fb
 ffff880119913058 ffff880119337fd8 000000000000fb88 ffff880119913058
Call Trace:
 [<ffffffff81119cd0>] ? sync_page+0x0/0x50
 [<ffffffff8150de73>] io_schedule+0x73/0xc0
 [<ffffffff81119d0d>] sync_page+0x3d/0x50
 [<ffffffff8150e82f>] __wait_on_bit+0x5f/0x90
 [<ffffffff81119f43>] wait_on_page_bit+0x73/0x80
 [<ffffffff81096cc0>] ? wake_bit_function+0x0/0x50
 [<ffffffff8112ef45>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111a36b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff8111a538>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811b1a0e>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811b1add>] vfs_fsync+0x1d/0x20
 [<ffffffff811b1b1e>] do_fsync+0x3e/0x60
 [<ffffffff811b1b70>] sys_fsync+0x10/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task iozone:15595 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 0000000000000000     0 15595  15576 0x00000080
 ffff880119337cc8 0000000000000082 0000000000000000 0003954793c03bea
 ffff880119337c38 ffff880119ad86e0 00000000082b3ee6 ffffffffae7c81fb
 ffff880119913058 ffff880119337fd8 000000000000fb88 ffff880119913058
Call Trace:
 [<ffffffff81119cd0>] ? sync_page+0x0/0x50
 [<ffffffff8150de73>] io_schedule+0x73/0xc0
 [<ffffffff81119d0d>] sync_page+0x3d/0x50
 [<ffffffff8150e82f>] __wait_on_bit+0x5f/0x90
 [<ffffffff81119f43>] wait_on_page_bit+0x73/0x80
 [<ffffffff81096cc0>] ? wake_bit_function+0x0/0x50
 [<ffffffff8112ef45>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111a36b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff8111a538>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811b1a0e>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811b1add>] vfs_fsync+0x1d/0x20
 [<ffffffff811b1b1e>] do_fsync+0x3e/0x60
 [<ffffffff811b1b70>] sys_fsync+0x10/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying


[root@rhsauto020 ~]# ps -auxww | grep iozone
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root     15595  0.1  1.3  63500 52244 pts/1    D+   May08   1:09 /opt/qa/tools/iozone -a
root     16039  0.0  0.0 103236   836 pts/2    S+   11:11   0:00 grep iozone


Expected results:

Iozone should finish execution and nfs if crashes in this manner, should dump cores.

Additional info:

Comment 3 Saurabh 2013-05-10 07:02:09 UTC

found same issue

with latest bigbend rpms i.e. "glusterfs-3.4.0.5rhs-1.el6rhs.x86_64"

nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
INFO: task iozone:5588 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
iozone        D 0000000000000001     0  5588   1850 0x00000080
 ffff88007ccf3cc8 0000000000000086 0000000000000000 0000d3d00744655c
 ffff88007ccf3c38 ffff880037c57600 00000000002e898c ffffffffae748e09
 ffff88007a5f5098 ffff88007ccf3fd8 000000000000fb88 ffff88007a5f5098
Call Trace:
 [<ffffffff81119db0>] ? sync_page+0x0/0x50
 [<ffffffff8150e0f3>] io_schedule+0x73/0xc0
 [<ffffffff81119ded>] sync_page+0x3d/0x50
 [<ffffffff8150eaaf>] __wait_on_bit+0x5f/0x90
 [<ffffffff8111a023>] wait_on_page_bit+0x73/0x80
 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50
 [<ffffffff8112f025>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111a44b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff8111a618>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811b1aee>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811b1bbd>] vfs_fsync+0x1d/0x20
 [<ffffffff811b1bfe>] do_fsync+0x3e/0x60
 [<ffffffff811b1c50>] sys_fsync+0x10/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying
nfs: server 10.70.37.155 not responding, still trying

Comment 7 Scott Haines 2013-09-23 22:39:41 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 8 Scott Haines 2013-09-23 22:43:47 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.