1116761 – core: all brick processes crash when quota is enabled

Bug 1116761 - core: all brick processes crash when quota is enabled

Summary: core: all brick processes crash when quota is enabled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	quota
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.0
Assignee:	Nagaprasad Sathyanarayana
QA Contact:	Saurabh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1118591
TreeView+	depends on / blocked

Reported:	2014-07-07 09:09 UTC by Saurabh
Modified:	2016-09-17 12:40 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.6.0.28-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1118591 (view as bug list)
Environment:
Last Closed:	2014-09-22 19:43:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
coredump (2.60 MB, application/x-compressed-tar) 2014-07-07 09:27 UTC, Saurabh	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:1278	0	normal	SHIPPED_LIVE	Red Hat Storage Server 3.0 bug fix and enhancement update	2014-09-22 23:26:55 UTC

Description Saurabh 2014-07-07 09:09:02 UTC

Description of problem:
I just upgraded the rhs nodes to glusterfs-3.6.0.24-1.el6rhs.x86_64 and post upgrade mounted the volume to a nfs client , executed iozone on the mount-point and iozone finished properly, but after some time I am finding that the brick processes have crashed though I enabled quota after iozone operation

with this backtrace,
pending frames:
frame : type(0) op(0)
frame : type(0) op(1)
frame : type(0) op(27)
frame : type(0) op(27)
frame : type(0) op(40)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2014-07-06 22:05:46
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.0.24
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f16eb4d1e56]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f16eb4ec28f]
/lib64/libc.so.6[0x3f4fa329a0]
/lib64/libc.so.6[0x3f4fa81461]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_loc_fill_from_name+0xa1)[0x7f16dbdf2651]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_readdir_cbk+0x2bf)[0x7f16dbdf628f]
/usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x7f16eb4de0b2]
/usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x7f16eb4de0b2]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir_cbk+0xc2)[0x7f16e0a17432]
/usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f16e0e4f3c8]
/usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_readdir+0x13)[0x7f16e0e4f603]
/usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x7f16eb4d7013]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir+0x22d)[0x7f16e0a1991d]
/usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x7f16eb4d7013]
/usr/lib64/libglusterfs.so.0(default_readdir_resume+0x142)[0x7f16eb4d9a02]
/usr/lib64/libglusterfs.so.0(call_resume+0x1b1)[0x7f16eb4f3631]
/usr/lib64/glusterfs/3.6.0.24/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f16e05f6348]
/lib64/libpthread.so.0[0x3f502079d1]
/lib64/libc.so.6(clone+0x6d)[0x3f4fae8b5d]
---------


gluster volume info

[root@nfs1 ~]# gluster volume info dist-rep
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 07f5f58d-83e3-4591-ba7f-e2473153e220
Status: Started
Snap Volume: no
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.62:/bricks/d1r1
Brick2: 10.70.37.215:/bricks/d1r2
Brick3: 10.70.37.44:/bricks/d2r1
Brick4: 10.70.37.201:/bricks/dr2r2
Brick5: 10.70.37.62:/bricks/d3r1
Brick6: 10.70.37.215:/bricks/d3r2
Brick7: 10.70.37.44:/bricks/d4r1
Brick8: 10.70.37.201:/bricks/dr4r2
Brick9: 10.70.37.62:/bricks/d5r1
Brick10: 10.70.37.215:/bricks/d5r2
Brick11: 10.70.37.44:/bricks/d6r1
Brick12: 10.70.37.201:/bricks/dr6r2
Brick13: 10.70.37.62:/bricks/d1r1-add
Brick14: 10.70.37.215:/bricks/d1r2-add
Options Reconfigured:
nfs-ganesha.enable: off
nfs-ganesha.host: 10.70.37.44
nfs.disable: off
performance.readdir-ahead: on
features.quota: on
features.quota-deem-statfs: off
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable


Version-Release number of selected component (if applicable):
glusterfs-3.6.0.24-1.el6rhs.x86_64

How reproducible:
crash seen once till now but for all bricks


Expected results:
crash is unexpected

Additional info:

Comment 1 Saurabh 2014-07-07 09:22:27 UTC

(gdb) bt
#0  0x0000003f4fa81461 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007f16dbdf2651 in mq_loc_fill_from_name (this=0xb8be10, newloc=0x7f16bf9f89a0, oldloc=0xbad66c, ino=<value optimized out>, name=0x7f169804d938 "appletalk")
    at marker-quota.c:176
#2  0x00007f16dbdf628f in mq_readdir_cbk (frame=0x7f16ea14bba8, cookie=<value optimized out>, this=0xb8be10, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    entries=0x7f16bf9f8bb0, xdata=0x0) at marker-quota.c:609
#3  0x00007f16eb4de0b2 in default_readdir_cbk (frame=0x7f16ea3274e4, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, entries=<value optimized out>, 
    xdata=0x0) at defaults.c:1225
#4  0x00007f16eb4de0b2 in default_readdir_cbk (frame=0x7f16ea323c74, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, entries=<value optimized out>, 
    xdata=0x0) at defaults.c:1225
#5  0x00007f16e0a17432 in posix_acl_readdir_cbk (frame=0x7f16ea31d700, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, 
    entries=<value optimized out>, xdata=0x0) at posix-acl.c:1486
#6  0x00007f16e0e4f3c8 in posix_do_readdir (frame=0x7f16ea3276e8, this=<value optimized out>, fd=<value optimized out>, size=<value optimized out>, off=23, whichop=28, dict=0x0)
    at posix.c:4946
#7  0x00007f16e0e4f603 in posix_readdir (frame=<value optimized out>, this=<value optimized out>, fd=<value optimized out>, size=<value optimized out>, off=<value optimized out>, 
    xdata=<value optimized out>) at posix.c:4958
#8  0x00007f16eb4d7013 in default_readdir (frame=0x7f16ea3276e8, this=0xb83070, fd=0xbcecb0, size=4096, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2067
#9  0x00007f16e0a1991d in posix_acl_readdir (frame=0x7f16ea31d700, this=0xb85ea0, fd=0xbcecb0, size=4096, offset=0, xdata=0x0) at posix-acl.c:1500
#10 0x00007f16eb4d7013 in default_readdir (frame=0x7f16ea31d700, this=0xb87130, fd=0xbcecb0, size=4096, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2067
#11 0x00007f16eb4d9a02 in default_readdir_resume (frame=0x7f16ea323c74, this=0xb88350, fd=0xbcecb0, size=4096, off=0, xdata=0x0) at defaults.c:1635
#12 0x00007f16eb4f3631 in call_resume_wind (stub=0x7f16e9dc1f38) at call-stub.c:2492
#13 call_resume (stub=0x7f16e9dc1f38) at call-stub.c:2841
#14 0x00007f16e05f6348 in iot_worker (data=0xbba080) at io-threads.c:214
#15 0x0000003f502079d1 in start_thread () from /lib64/libpthread.so.0
#16 0x0000003f4fae8b5d in clone () from /lib64/libc.so.6

Comment 3 Saurabh 2014-07-07 09:27:43 UTC

Created attachment 916003 [details]
coredump

Comment 4 Saurabh 2014-07-07 09:29:41 UTC

further trace of bt,
(gdb) f 1
#1  0x00007f16dbdf2651 in mq_loc_fill_from_name (this=0xb8be10, newloc=0x7f16bf9f89a0, oldloc=0xbad66c, ino=<value optimized out>, name=0x7f169804d938 "appletalk")
    at marker-quota.c:176
176	        len = strlen (oldloc->path);
(gdb) list
171	        }
172	
173	        newloc->parent = inode_ref (oldloc->inode);
174	        uuid_copy (newloc->pargfid, oldloc->inode->gfid);
175	
176	        len = strlen (oldloc->path);
177	
178	        if (oldloc->path [len - 1] == '/')
179	                ret = gf_asprintf ((char **) &path, "%s%s",
180	                                   oldloc->path, name);
(gdb) p oldloc
$1 = (loc_t *) 0xbad66c
(gdb) p *$
$2 = {path = 0x0, name = 0x0, inode = 0x7f16d91760b4, parent = 0x7f16d90f4be0, gfid = "0\367H\216\361QF3\237\314\335\026\327\t\"p", 
  pargfid = "\037\062b<X\031Ej\232\035\000\346y\303\037\017"}
(gdb)

Comment 5 Saurabh 2014-07-07 10:14:55 UTC

as per the discussion with other qe team members, just for FYI the issue is seen even if we try to install fresh rpm of latest build. fresh install of rpm means, removing the earlier ones and then installing rpms, so altogether I want to mention is that issue should not be considered as mere outcome of rpm upgrade.

Comment 6 Saurabh 2014-07-09 10:58:53 UTC

Collect sosreports from here, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1116761/

Comment 7 Saurabh 2014-07-14 11:32:51 UTC

 As per latest trials by qe team,
 we have seen the issue happening with update from .22 to .24 rpm of RHS 3.0 everytime and happening on different setups of RHS nodes, seen only for existing volumes before upgrade.
Now, we tried out to enable quota for a volume created fresh after upgrade and the issue in not seen for this particular volume. 


Also, 

   We tried to do the upgrade from RHS2.1 system to RHS3.0 system. The issue was not seen for this test as well mean to say brick processes didn't crash.

Comment 9 Shruti Sampat 2014-07-21 11:40:00 UTC

Hi,

I saw this issue after updating from build .22 to build .24, even for a volume that was created after upgrade. It was a distributed-replicate volume (2x2). Performed the following steps on the volume -

1. Started untar of Linux kernel at the mount of the volume (it was a fuse mount)
2. While the untar was going on, one of the brick processes was killed.
3. After a while, the tar jobs at the mount point were killed, and the killed bricks were brought back up. This caused self-heal to begin.
4. While self-heal was going on, quota was enabled on the volume.

After a while 3 out of 4 bricks were found to have crashed. Following is from the brick logs -

---------------------------------------------------------------------------------

pending frames:
frame : type(0) op(0)
frame : type(0) op(33)
frame : type(0) op(33)
frame : type(0) op(29)
frame : type(0) op(40)
frame : type(0) op(11)
frame : type(0) op(29)
frame : type(0) op(25)
frame : type(0) op(40)
frame : type(0) op(33)
frame : type(0) op(27)
frame : type(0) op(29)
frame : type(0) op(33)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
frame : type(0) op(29)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2014-07-21 03:52:24
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.0.24
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x381461fe56]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x381463a28f]
/lib64/libc.so.6[0x307b2329a0]
/lib64/libc.so.6[0x307b281461]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_loc_fill_from_name+0xa1)[0x7f96f87a3651]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_readdir_cbk+0x2bf)[0x7f96f87a728f]
/usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x381462c0b2]
/usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x381462c0b2]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir_cbk+0xc2)[0x7f96f91ea432]
/usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f96f96223c8]
/usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_readdir+0x13)[0x7f96f9622603]
/usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x3814625013]
/usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir+0x22d)[0x7f96f91ec91d]
/usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x3814625013]
/usr/lib64/libglusterfs.so.0(default_readdir_resume+0x142)[0x3814627a02]
/usr/lib64/libglusterfs.so.0(call_resume+0x1b1)[0x3814641631]
/usr/lib64/glusterfs/3.6.0.24/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f96f8dc9348]
/lib64/libpthread.so.0[0x307ba079d1]
/lib64/libc.so.6(clone+0x6d)[0x307b2e8b5d]
---------

# gluster volume info vol1
 
Volume Name: vol1
Type: Distributed-Replicate
Volume ID: c01d9e5c-3346-4ba8-9d7d-7964812a76ab
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.182:/rhs/brick1/b1
Brick2: 10.70.37.73:/rhs/brick1/b1
Brick3: 10.70.37.112:/rhs/brick1/b1
Brick4: 10.70.37.79:/rhs/brick1/b1
Options Reconfigured:
features.quota: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
cluster.server-quorum-ratio: 80%

Comment 14 Saurabh 2014-09-10 11:20:48 UTC

For verification purpose I tried two cases
1. the upgrade from glusterfs-3.6.0.27-1.el6rhs.x86_64 to glusterfs-3.6.0.28-1.el6rhs.x86_64, enabled quota and set limit on dirs. then executed some I/O on mount-point(nfs).

  Result:- no crash found as mentioned in description part of the BZ

Also, installed fresh iso to create new VMs for glusterfs-3.6.0.28-1.el6rhs.x86_64
Executed the test case mentioned in comment 9. 

  Result:- no crash found as mentioned in this BZ.

Comment 16 errata-xmlrpc 2014-09-22 19:43:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html

Note You need to log in before you can comment on or make changes to this bug.