Description of problem: I just upgraded the rhs nodes to glusterfs-3.6.0.24-1.el6rhs.x86_64 and post upgrade mounted the volume to a nfs client , executed iozone on the mount-point and iozone finished properly, but after some time I am finding that the brick processes have crashed though I enabled quota after iozone operation with this backtrace, pending frames: frame : type(0) op(0) frame : type(0) op(1) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(40) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-07-06 22:05:46 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.0.24 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7f16eb4d1e56] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7f16eb4ec28f] /lib64/libc.so.6[0x3f4fa329a0] /lib64/libc.so.6[0x3f4fa81461] /usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_loc_fill_from_name+0xa1)[0x7f16dbdf2651] /usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_readdir_cbk+0x2bf)[0x7f16dbdf628f] /usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x7f16eb4de0b2] /usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x7f16eb4de0b2] /usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir_cbk+0xc2)[0x7f16e0a17432] /usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f16e0e4f3c8] /usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_readdir+0x13)[0x7f16e0e4f603] /usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x7f16eb4d7013] /usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir+0x22d)[0x7f16e0a1991d] /usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x7f16eb4d7013] /usr/lib64/libglusterfs.so.0(default_readdir_resume+0x142)[0x7f16eb4d9a02] /usr/lib64/libglusterfs.so.0(call_resume+0x1b1)[0x7f16eb4f3631] /usr/lib64/glusterfs/3.6.0.24/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f16e05f6348] /lib64/libpthread.so.0[0x3f502079d1] /lib64/libc.so.6(clone+0x6d)[0x3f4fae8b5d] --------- gluster volume info [root@nfs1 ~]# gluster volume info dist-rep Volume Name: dist-rep Type: Distributed-Replicate Volume ID: 07f5f58d-83e3-4591-ba7f-e2473153e220 Status: Started Snap Volume: no Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.62:/bricks/d1r1 Brick2: 10.70.37.215:/bricks/d1r2 Brick3: 10.70.37.44:/bricks/d2r1 Brick4: 10.70.37.201:/bricks/dr2r2 Brick5: 10.70.37.62:/bricks/d3r1 Brick6: 10.70.37.215:/bricks/d3r2 Brick7: 10.70.37.44:/bricks/d4r1 Brick8: 10.70.37.201:/bricks/dr4r2 Brick9: 10.70.37.62:/bricks/d5r1 Brick10: 10.70.37.215:/bricks/d5r2 Brick11: 10.70.37.44:/bricks/d6r1 Brick12: 10.70.37.201:/bricks/dr6r2 Brick13: 10.70.37.62:/bricks/d1r1-add Brick14: 10.70.37.215:/bricks/d1r2-add Options Reconfigured: nfs-ganesha.enable: off nfs-ganesha.host: 10.70.37.44 nfs.disable: off performance.readdir-ahead: on features.quota: on features.quota-deem-statfs: off snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable Version-Release number of selected component (if applicable): glusterfs-3.6.0.24-1.el6rhs.x86_64 How reproducible: crash seen once till now but for all bricks Expected results: crash is unexpected Additional info:
(gdb) bt #0 0x0000003f4fa81461 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007f16dbdf2651 in mq_loc_fill_from_name (this=0xb8be10, newloc=0x7f16bf9f89a0, oldloc=0xbad66c, ino=<value optimized out>, name=0x7f169804d938 "appletalk") at marker-quota.c:176 #2 0x00007f16dbdf628f in mq_readdir_cbk (frame=0x7f16ea14bba8, cookie=<value optimized out>, this=0xb8be10, op_ret=<value optimized out>, op_errno=<value optimized out>, entries=0x7f16bf9f8bb0, xdata=0x0) at marker-quota.c:609 #3 0x00007f16eb4de0b2 in default_readdir_cbk (frame=0x7f16ea3274e4, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, entries=<value optimized out>, xdata=0x0) at defaults.c:1225 #4 0x00007f16eb4de0b2 in default_readdir_cbk (frame=0x7f16ea323c74, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, entries=<value optimized out>, xdata=0x0) at defaults.c:1225 #5 0x00007f16e0a17432 in posix_acl_readdir_cbk (frame=0x7f16ea31d700, cookie=<value optimized out>, this=<value optimized out>, op_ret=23, op_errno=0, entries=<value optimized out>, xdata=0x0) at posix-acl.c:1486 #6 0x00007f16e0e4f3c8 in posix_do_readdir (frame=0x7f16ea3276e8, this=<value optimized out>, fd=<value optimized out>, size=<value optimized out>, off=23, whichop=28, dict=0x0) at posix.c:4946 #7 0x00007f16e0e4f603 in posix_readdir (frame=<value optimized out>, this=<value optimized out>, fd=<value optimized out>, size=<value optimized out>, off=<value optimized out>, xdata=<value optimized out>) at posix.c:4958 #8 0x00007f16eb4d7013 in default_readdir (frame=0x7f16ea3276e8, this=0xb83070, fd=0xbcecb0, size=4096, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2067 #9 0x00007f16e0a1991d in posix_acl_readdir (frame=0x7f16ea31d700, this=0xb85ea0, fd=0xbcecb0, size=4096, offset=0, xdata=0x0) at posix-acl.c:1500 #10 0x00007f16eb4d7013 in default_readdir (frame=0x7f16ea31d700, this=0xb87130, fd=0xbcecb0, size=4096, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2067 #11 0x00007f16eb4d9a02 in default_readdir_resume (frame=0x7f16ea323c74, this=0xb88350, fd=0xbcecb0, size=4096, off=0, xdata=0x0) at defaults.c:1635 #12 0x00007f16eb4f3631 in call_resume_wind (stub=0x7f16e9dc1f38) at call-stub.c:2492 #13 call_resume (stub=0x7f16e9dc1f38) at call-stub.c:2841 #14 0x00007f16e05f6348 in iot_worker (data=0xbba080) at io-threads.c:214 #15 0x0000003f502079d1 in start_thread () from /lib64/libpthread.so.0 #16 0x0000003f4fae8b5d in clone () from /lib64/libc.so.6
Created attachment 916003 [details] coredump
further trace of bt, (gdb) f 1 #1 0x00007f16dbdf2651 in mq_loc_fill_from_name (this=0xb8be10, newloc=0x7f16bf9f89a0, oldloc=0xbad66c, ino=<value optimized out>, name=0x7f169804d938 "appletalk") at marker-quota.c:176 176 len = strlen (oldloc->path); (gdb) list 171 } 172 173 newloc->parent = inode_ref (oldloc->inode); 174 uuid_copy (newloc->pargfid, oldloc->inode->gfid); 175 176 len = strlen (oldloc->path); 177 178 if (oldloc->path [len - 1] == '/') 179 ret = gf_asprintf ((char **) &path, "%s%s", 180 oldloc->path, name); (gdb) p oldloc $1 = (loc_t *) 0xbad66c (gdb) p *$ $2 = {path = 0x0, name = 0x0, inode = 0x7f16d91760b4, parent = 0x7f16d90f4be0, gfid = "0\367H\216\361QF3\237\314\335\026\327\t\"p", pargfid = "\037\062b<X\031Ej\232\035\000\346y\303\037\017"} (gdb)
as per the discussion with other qe team members, just for FYI the issue is seen even if we try to install fresh rpm of latest build. fresh install of rpm means, removing the earlier ones and then installing rpms, so altogether I want to mention is that issue should not be considered as mere outcome of rpm upgrade.
Collect sosreports from here, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1116761/
As per latest trials by qe team, we have seen the issue happening with update from .22 to .24 rpm of RHS 3.0 everytime and happening on different setups of RHS nodes, seen only for existing volumes before upgrade. Now, we tried out to enable quota for a volume created fresh after upgrade and the issue in not seen for this particular volume. Also, We tried to do the upgrade from RHS2.1 system to RHS3.0 system. The issue was not seen for this test as well mean to say brick processes didn't crash.
Hi, I saw this issue after updating from build .22 to build .24, even for a volume that was created after upgrade. It was a distributed-replicate volume (2x2). Performed the following steps on the volume - 1. Started untar of Linux kernel at the mount of the volume (it was a fuse mount) 2. While the untar was going on, one of the brick processes was killed. 3. After a while, the tar jobs at the mount point were killed, and the killed bricks were brought back up. This caused self-heal to begin. 4. While self-heal was going on, quota was enabled on the volume. After a while 3 out of 4 bricks were found to have crashed. Following is from the brick logs - --------------------------------------------------------------------------------- pending frames: frame : type(0) op(0) frame : type(0) op(33) frame : type(0) op(33) frame : type(0) op(29) frame : type(0) op(40) frame : type(0) op(11) frame : type(0) op(29) frame : type(0) op(25) frame : type(0) op(40) frame : type(0) op(33) frame : type(0) op(27) frame : type(0) op(29) frame : type(0) op(33) frame : type(0) op(29) frame : type(0) op(29) frame : type(0) op(29) frame : type(0) op(29) frame : type(0) op(29) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2014-07-21 03:52:24 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.0.24 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x381461fe56] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x381463a28f] /lib64/libc.so.6[0x307b2329a0] /lib64/libc.so.6[0x307b281461] /usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_loc_fill_from_name+0xa1)[0x7f96f87a3651] /usr/lib64/glusterfs/3.6.0.24/xlator/features/marker.so(mq_readdir_cbk+0x2bf)[0x7f96f87a728f] /usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x381462c0b2] /usr/lib64/libglusterfs.so.0(default_readdir_cbk+0xc2)[0x381462c0b2] /usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir_cbk+0xc2)[0x7f96f91ea432] /usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f96f96223c8] /usr/lib64/glusterfs/3.6.0.24/xlator/storage/posix.so(posix_readdir+0x13)[0x7f96f9622603] /usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x3814625013] /usr/lib64/glusterfs/3.6.0.24/xlator/features/access-control.so(posix_acl_readdir+0x22d)[0x7f96f91ec91d] /usr/lib64/libglusterfs.so.0(default_readdir+0x83)[0x3814625013] /usr/lib64/libglusterfs.so.0(default_readdir_resume+0x142)[0x3814627a02] /usr/lib64/libglusterfs.so.0(call_resume+0x1b1)[0x3814641631] /usr/lib64/glusterfs/3.6.0.24/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f96f8dc9348] /lib64/libpthread.so.0[0x307ba079d1] /lib64/libc.so.6(clone+0x6d)[0x307b2e8b5d] --------- # gluster volume info vol1 Volume Name: vol1 Type: Distributed-Replicate Volume ID: c01d9e5c-3346-4ba8-9d7d-7964812a76ab Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.182:/rhs/brick1/b1 Brick2: 10.70.37.73:/rhs/brick1/b1 Brick3: 10.70.37.112:/rhs/brick1/b1 Brick4: 10.70.37.79:/rhs/brick1/b1 Options Reconfigured: features.quota: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable cluster.server-quorum-ratio: 80%
For verification purpose I tried two cases 1. the upgrade from glusterfs-3.6.0.27-1.el6rhs.x86_64 to glusterfs-3.6.0.28-1.el6rhs.x86_64, enabled quota and set limit on dirs. then executed some I/O on mount-point(nfs). Result:- no crash found as mentioned in description part of the BZ Also, installed fresh iso to create new VMs for glusterfs-3.6.0.28-1.el6rhs.x86_64 Executed the test case mentioned in comment 9. Result:- no crash found as mentioned in this BZ.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html