+++ This bug was initially created as a clone of Bug #1200677 +++ Description of problem: Crash and core dump observed during disk replacement Version-Release number of selected component (if applicable): [root@rhsauto024 tmp]# rpm -qa | grep glusterfs glusterfs-libs-3.6.0.50-1.el6rhs.x86_64 samba-glusterfs-3.6.509-169.4.el6rhs.x86_64 glusterfs-devel-3.6.0.50-1.el6rhs.x86_64 glusterfs-api-3.6.0.50-1.el6rhs.x86_64 glusterfs-fuse-3.6.0.50-1.el6rhs.x86_64 glusterfs-server-3.6.0.50-1.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.50-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.50-1.el6rhs.x86_64 glusterfs-debuginfo-3.6.0.50-1.el6rhs.x86_64 glusterfs-3.6.0.50-1.el6rhs.x86_64 glusterfs-cli-3.6.0.50-1.el6rhs.x86_64 How reproducible: 1/1 Tests steps: ==================== 1. Create a 1 x 2 replicate volume. Start the volume. 2. Create fuse mount. Create files and directories from mount. 3. Bring down brick2. Simulate disk replacement (kill the brick2, remove the contents of brick2 including the ".glusterfs" directory) 4. Bring back brick2. 5. add a iptable rule on brick1 to block the incoming data to brick1 port so that mount process have a disconnect to brick1. (simulating network dis-connection) 6. create files and directory from mount. 7. remove the iptables rule to allow the brick1 port so that mount reconnects. Actual results: Crash and Core dump observed Expected results: Additional info: --- Additional comment from RHEL Product and Program Management on 2015-03-11 03:32:37 EDT --- Since this issue was entered in bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release. --- Additional comment from Anil Shah on 2015-03-11 04:41:04 EDT --- logs uploaded at location http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1200677/ --- Additional comment from Ravishankar N on 2015-03-11 05:20:16 EDT --- Note to self: Logged into rhsauto024 and looked at the core: gdb glusterfsd core.dump.PID\=18404UID\=0 (gdb) bt #0 0x00007fa15055743d in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7fa15f927cc4, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=<value optimized out>, op_errno=<value optimized out>, preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:918 #1 0x00007fa1507a2d64 in client3_3_setattr (frame=0x7fa15faefd30, this=<value optimized out>, data=<value optimized out>) at client-rpc-fops.c:5906 #2 0x00007fa1507995d9 in client_setattr (frame=0x7fa15faefd30, this=0x7fa14c00b8f0, loc=<value optimized out>, stbuf=<value optimized out>, valid=<value optimized out>, xdata=<value optimized out>) at client.c:1999 #3 0x00007fa150556090 in afr_sh_entry_impunge_setattr (impunge_frame=0x7fa15f922de0, this=<value optimized out>) at afr-self-heal-entry.c:970 #4 0x00007fa150556603 in afr_sh_entry_impunge_xattrop_cbk (impunge_frame=0x7fa15f922de0, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=0, op_errno=0, xattr=<value optimized out>, xdata=0x0) at afr-self-heal-entry.c:1030 #5 0x00007fa1507ad1b9 in client3_3_xattrop_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7fa15faef7d0) at client-rpc-fops.c:1740 #6 0x00007fa1618778c5 in rpc_clnt_handle_reply (clnt=0x7fa14c0713c0, pollin=0x7fa14c001430) at rpc-clnt.c:763 #7 0x00007fa161878d52 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7fa14c0713f0, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:891 #8 0x00007fa161874528 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:539 #9 0x00007fa1529f333d in socket_event_poll_in (this=0x7fa14c080fc0) at socket.c:2171 #10 0x00007fa1529f4e2d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7fa14c080fc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2284 #11 0x00007fa161afc4a0 in event_dispatch_epoll_handler (data=0x7fa162c08310) at event-epoll.c:572 #12 event_dispatch_epoll_worker (data=0x7fa162c08310) at event-epoll.c:674 #13 0x00007fa1612339d1 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fa160b9d8fd in clone () from /lib64/libc.so.6 (gdb) f 0 #0 0x00007fa15055743d in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7fa15f927cc4, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=<value optimized out>, op_errno=<value optimized out>, preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:918 918 gf_log (this->name, GF_LOG_INFO, (gdb) l 913 int call_count = 0; 914 afr_local_t *setattr_local = NULL; 915 916 setattr_local = setattr_frame->local; 917 if (op_ret != 0) { 918 gf_log (this->name, GF_LOG_INFO, 919 "setattr on parent directory (%s) failed: %s", 920 setattr_local->loc.path, strerror (op_errno)); 921 } 922 (gdb) p setattr_local->loc Cannot access memory at address 0x440 (gdb) p setattr_local $1 = (afr_local_t *) 0x0 (gdb) f 3 #3 0x00007fa150556090 in afr_sh_entry_impunge_setattr (impunge_frame=0x7fa15f922de0, this=<value optimized out>) at afr-self-heal-entry.c:970 970 STACK_WIND_COOKIE (setattr_frame, (gdb) l 965 setattr_local->call_count = call_count; 966 for (i = 0; i < priv->child_count; i++) { 967 if (impunge_sh->child_errno[i]) 968 continue; 969 valid = GF_SET_ATTR_ATIME | GF_SET_ATTR_MTIME; 970 STACK_WIND_COOKIE (setattr_frame, 971 afr_sh_entry_impunge_parent_setattr_cbk, 972 (void *) (long) i, priv->children[i], 973 priv->children[i]->fops->setattr, 974 &setattr_local->loc, (gdb) 975 &impunge_sh->parentbuf, valid, NULL); 976 977 valid = GF_SET_ATTR_UID | GF_SET_ATTR_GID | 978 GF_SET_ATTR_ATIME | GF_SET_ATTR_MTIME; 979 STACK_WIND_COOKIE (impunge_frame, 980 afr_sh_entry_impunge_setattr_cbk, 981 (void *) (long) i, priv->children[i], 982 priv->children[i]->fops->setattr, 983 &impunge_local->loc, 984 &impunge_sh->entrybuf, valid, NULL); (gdb) 985 call_count--; 986 } 987 GF_ASSERT (!call_count); 988 return 0; 989 out: 990 if (setattr_frame) 991 AFR_STACK_DESTROY (setattr_frame); 992 afr_sh_entry_call_impunge_done (impunge_frame, this, 0, op_errno); 993 return 0; 994 } (gdb) p impunge_sh->child_errno[0] $2 = 2 (gdb) p impunge_sh->child_errno[1] $3 = 0 (gdb) impunge_sh->child_errno[2] Undefined command: "impunge_sh->child_errno". Try "help". (gdb) p impunge_sh->child_errno[2] $4 = 0 (gdb) p call_count $5 = -1 (gdb)
REVIEW: http://review.gluster.org/9856 (afr: exit out of stack winds in for loops if call_count is zero) posted (#1) for review on release-3.5 by Ravishankar N (ravishankar)
Ravi, this bug is marked private (it has some non-public groups set). If thus bug can be made public, uncheck all the groups on the right of the attachment table.
Done! Thanks Niels, I was wondering why http://build.gluster.org/job/compare-bug-version-and-git-branch/3898/console said bug doesn't belong to glusterfs. Perhaps this was the reason.
COMMIT: http://review.gluster.org/9856 committed in release-3.5 by Niels de Vos (ndevos) ------ commit 147b3871180a699a642767d0cc0ea00fa69a33c8 Author: Ravishankar N <ravishankar> Date: Wed Mar 11 16:41:06 2015 +0530 afr: exit out of stack winds in for loops if call_count is zero ....in order to avoid a race where the fop cbk frees the frame's local variables and the fop tries to access it at a later point in time. Change-Id: I91d2696e5e183c61ea1368b3a538f9ed7f3851de BUG: 1200764 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/9856 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: pranith karampuri <pranith.k> Reviewed-by: Niels de Vos <ndevos>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.4, please reopen this bug report. glusterfs-3.5.4 has been announced on the Gluster Packaging mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/2 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user