1200764 – [AFR] Core dump and crash observed during disk replacement case

Bug 1200764 - [AFR] Core dump and crash observed during disk replacement case

Summary: [AFR] Core dump and crash observed during disk replacement case

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.5.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1200677
TreeView+	depends on / blocked

Reported:	2015-03-11 11:06 UTC by Ravishankar N
Modified:	2015-06-03 21:09 UTC (History)
CC List:	4 users (show)
Fixed In Version:	glusterfs-3.5.4
Clone Of:	1200677
Environment:
Last Closed:	2015-06-03 21:09:17 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ravishankar N 2015-03-11 11:06:49 UTC

+++ This bug was initially created as a clone of Bug #1200677 +++

Description of problem:

Crash and core dump observed during disk replacement

Version-Release number of selected component (if applicable):

[root@rhsauto024 tmp]# rpm -qa | grep glusterfs
glusterfs-libs-3.6.0.50-1.el6rhs.x86_64
samba-glusterfs-3.6.509-169.4.el6rhs.x86_64
glusterfs-devel-3.6.0.50-1.el6rhs.x86_64
glusterfs-api-3.6.0.50-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.50-1.el6rhs.x86_64
glusterfs-server-3.6.0.50-1.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.50-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.50-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.50-1.el6rhs.x86_64
glusterfs-3.6.0.50-1.el6rhs.x86_64
glusterfs-cli-3.6.0.50-1.el6rhs.x86_64


How reproducible:

1/1

Tests steps:
====================
1. Create a  1 x 2 replicate volume. Start the volume. 

2. Create fuse mount. Create files and directories from mount. 

3. Bring down brick2. Simulate disk replacement (kill the brick2, remove the contents of brick2 including the ".glusterfs" directory)

4. Bring back brick2. 

5. add a iptable rule on brick1 to block the incoming data to brick1 port so that mount process have a disconnect to brick1. (simulating network dis-connection)

6. create files and directory from mount.

7. remove the iptables rule to allow the brick1 port so that mount reconnects.
 

Actual results:

Crash and Core dump observed

Expected results:



Additional info:

--- Additional comment from RHEL Product and Program Management on 2015-03-11 03:32:37 EDT ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Anil Shah on 2015-03-11 04:41:04 EDT ---

logs uploaded at location
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1200677/

--- Additional comment from Ravishankar N on 2015-03-11 05:20:16 EDT ---

Note to self:
Logged into rhsauto024 and looked at the core:
gdb  glusterfsd core.dump.PID\=18404UID\=0
(gdb) bt
#0  0x00007fa15055743d in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7fa15f927cc4, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:918
#1  0x00007fa1507a2d64 in client3_3_setattr (frame=0x7fa15faefd30, this=<value optimized out>, data=<value optimized out>) at client-rpc-fops.c:5906
#2  0x00007fa1507995d9 in client_setattr (frame=0x7fa15faefd30, this=0x7fa14c00b8f0, loc=<value optimized out>, stbuf=<value optimized out>, valid=<value optimized out>, xdata=<value optimized out>)
    at client.c:1999
#3  0x00007fa150556090 in afr_sh_entry_impunge_setattr (impunge_frame=0x7fa15f922de0, this=<value optimized out>) at afr-self-heal-entry.c:970
#4  0x00007fa150556603 in afr_sh_entry_impunge_xattrop_cbk (impunge_frame=0x7fa15f922de0, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=0, op_errno=0, xattr=<value optimized out>, xdata=0x0)
    at afr-self-heal-entry.c:1030
#5  0x00007fa1507ad1b9 in client3_3_xattrop_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7fa15faef7d0) at client-rpc-fops.c:1740
#6  0x00007fa1618778c5 in rpc_clnt_handle_reply (clnt=0x7fa14c0713c0, pollin=0x7fa14c001430) at rpc-clnt.c:763
#7  0x00007fa161878d52 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7fa14c0713f0, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:891
#8  0x00007fa161874528 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:539
#9  0x00007fa1529f333d in socket_event_poll_in (this=0x7fa14c080fc0) at socket.c:2171
#10 0x00007fa1529f4e2d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7fa14c080fc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2284
#11 0x00007fa161afc4a0 in event_dispatch_epoll_handler (data=0x7fa162c08310) at event-epoll.c:572
#12 event_dispatch_epoll_worker (data=0x7fa162c08310) at event-epoll.c:674
#13 0x00007fa1612339d1 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fa160b9d8fd in clone () from /lib64/libc.so.6

(gdb) f 0
#0  0x00007fa15055743d in afr_sh_entry_impunge_parent_setattr_cbk (setattr_frame=0x7fa15f927cc4, cookie=<value optimized out>, this=0x7fa14c00de40, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    preop=<value optimized out>, postop=0x0, xdata=0x0) at afr-self-heal-entry.c:918
918                     gf_log (this->name, GF_LOG_INFO,
(gdb) l
913             int             call_count = 0;
914             afr_local_t     *setattr_local = NULL;
915
916             setattr_local = setattr_frame->local;
917             if (op_ret != 0) {
918                     gf_log (this->name, GF_LOG_INFO,
919                             "setattr on parent directory (%s) failed: %s",
920                             setattr_local->loc.path, strerror (op_errno));
921             }
922
(gdb) p setattr_local->loc
Cannot access memory at address 0x440
(gdb) p setattr_local
$1 = (afr_local_t *) 0x0
(gdb) f 3
#3  0x00007fa150556090 in afr_sh_entry_impunge_setattr (impunge_frame=0x7fa15f922de0, this=<value optimized out>) at afr-self-heal-entry.c:970
970                     STACK_WIND_COOKIE (setattr_frame,
(gdb) l
965             setattr_local->call_count = call_count;
966             for (i = 0; i < priv->child_count; i++) {
967                     if (impunge_sh->child_errno[i])
968                             continue;
969                     valid         = GF_SET_ATTR_ATIME | GF_SET_ATTR_MTIME;
970                     STACK_WIND_COOKIE (setattr_frame,
971                                        afr_sh_entry_impunge_parent_setattr_cbk,
972                                        (void *) (long) i, priv->children[i],
973                                        priv->children[i]->fops->setattr,
974                                        &setattr_local->loc,
(gdb) 
975                                        &impunge_sh->parentbuf, valid, NULL);
976
977                     valid = GF_SET_ATTR_UID   | GF_SET_ATTR_GID |
978                             GF_SET_ATTR_ATIME | GF_SET_ATTR_MTIME;
979                     STACK_WIND_COOKIE (impunge_frame,
980                                        afr_sh_entry_impunge_setattr_cbk,
981                                        (void *) (long) i, priv->children[i],
982                                        priv->children[i]->fops->setattr,
983                                        &impunge_local->loc,
984                                        &impunge_sh->entrybuf, valid, NULL);
(gdb) 
985                     call_count--;
986             }
987             GF_ASSERT (!call_count);
988             return 0;
989     out:
990             if (setattr_frame)
991                     AFR_STACK_DESTROY (setattr_frame);
992             afr_sh_entry_call_impunge_done (impunge_frame, this, 0, op_errno);
993             return 0;
994     }
(gdb) p impunge_sh->child_errno[0]
$2 = 2
(gdb) p impunge_sh->child_errno[1]
$3 = 0
(gdb) impunge_sh->child_errno[2]
Undefined command: "impunge_sh->child_errno".  Try "help".
(gdb) p impunge_sh->child_errno[2]
$4 = 0
(gdb) p call_count
$5 = -1
(gdb)

Comment 1 Anand Avati 2015-03-11 11:17:11 UTC

REVIEW: http://review.gluster.org/9856 (afr: exit out of stack winds in for loops if call_count is zero) posted (#1) for review on release-3.5 by Ravishankar N (ravishankar)

Comment 2 Niels de Vos 2015-03-11 13:18:48 UTC

Ravi, this bug is marked private (it has some non-public groups set). If thus bug can be made public, uncheck all the groups on the right of the attachment table.

Comment 3 Ravishankar N 2015-03-11 13:59:40 UTC

Done! Thanks Niels, I was wondering why http://build.gluster.org/job/compare-bug-version-and-git-branch/3898/console said bug doesn't belong to glusterfs. Perhaps this was the reason.

Comment 4 Anand Avati 2015-03-12 04:14:47 UTC

COMMIT: http://review.gluster.org/9856 committed in release-3.5 by Niels de Vos (ndevos) 
------
commit 147b3871180a699a642767d0cc0ea00fa69a33c8
Author: Ravishankar N <ravishankar>
Date:   Wed Mar 11 16:41:06 2015 +0530

    afr: exit out of stack winds in for loops if call_count is zero
    
    ....in order to avoid a race where the fop cbk frees the frame's local
    variables and the fop tries to access it at a later point in time.
    
    Change-Id: I91d2696e5e183c61ea1368b3a538f9ed7f3851de
    BUG: 1200764
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/9856
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: pranith karampuri <pranith.k>
    Reviewed-by: Niels de Vos <ndevos>

Comment 5 Niels de Vos 2015-06-03 21:09:17 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.4, please reopen this bug report.

glusterfs-3.5.4 has been announced on the Gluster Packaging mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/2
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.