800352 – [glusterfs-3.3.0qa25]: glustershd crashed in afr_start_crawl

Bug 800352 - [glusterfs-3.3.0qa25]: glustershd crashed in afr_start_crawl

Summary: [glusterfs-3.3.0qa25]: glustershd crashed in afr_start_crawl

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	817967
TreeView+	depends on / blocked

Reported:	2012-03-06 11:39 UTC by Raghavendra Bhat
Modified:	2015-12-01 16:45 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:19:55 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Raghavendra Bhat 2012-03-06 11:39:36 UTC

Description of problem:
2x2 distributed replicate volume. 1 fuse and 1 nfs client. On fuse client was running rdd, ping_pong, fs-perf-test and threaded-io in a loop. Brought down a brick. Turned self-heal-daemon off. After some time brought the brick up, turned on self-heal-daemon and gave volume heal start full. glustershd crashed at afr_start_crawl (ping_pong was running on the client at the time of crash).

This is the backtrace of the core.

Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, 
    exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047
1047            gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s",
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64
(gdb) bt
#0  0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, 
    exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047
#1  0x00007f2a5174b2ec in _do_self_heal_on_subvol (this=0x23ceab0, child=-1, crawl=FULL)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:358
#2  0x00007f2a5174b40b in _do_self_heal_on_local_subvol (this=0x23ceab0, crawl=FULL)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:387
#3  0x00007f2a5174b689 in afr_xl_op (this=0x23ceab0, input=0x7f2a440009c0, output=0x7f2a440011b0)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:448
#4  0x00007f2a5175e888 in afr_notify (this=0x23ceab0, event=14, data=0x7f2a440009c0, data2=0x7f2a440011b0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:3507
#5  0x00007f2a5175f9bd in notify (this=0x23ceab0, event=14, data=0x7f2a440009c0) at ../../../../../xlators/cluster/afr/src/afr.c:51
#6  0x000000000040a215 in glusterfs_handle_translator_op (data=0x23add9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:726
#7  0x00007f2a55d34753 in synctask_wrap (old_task=0x24b4430) at ../../../libglusterfs/src/syncop.c:144
#8  0x000000390f443690 in ?? () from /lib64/libc.so.6
#9  0x0000000000000000 in ?? ()
(gdb) f 0
#0  0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, 
    exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047
1047            gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s",
(gdb) l
1042            crawl_data->child = idx;
1043            crawl_data->pid = frame->root->pid;
1044            crawl_data->crawl = crawl;
1045            crawl_data->op_data = op_data;
1046            crawl_data->crawl_flags = crawl_flags;
1047            gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s",
1048                    crawl_data->crawl, priv->children[idx]->name);
1049
1050            if (exclusive)
1051                    crawler = afr_dir_exclusive_crawl;
(gdb) p this->name
$1 = 0x23cdd60 "mirror-replicate-1"
(gdb) p crawl_data->crawl
$2 = FULL
(gdb) p priv->children[idx]->name
Cannot access memory at address 0x0
(gdb) p idx
$3 = -1
(gdb) f 1
#1  0x00007f2a5174b2ec in _do_self_heal_on_subvol (this=0x23ceab0, child=-1, crawl=FULL)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:358
358             afr_start_crawl (this, child, crawl, _self_heal_entry,
(gdb) f 2
#2  0x00007f2a5174b40b in _do_self_heal_on_local_subvol (this=0x23ceab0, crawl=FULL)
    at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:387
387             _do_self_heal_on_subvol (this, local_child, FULL);
(gdb) p local_child
$4 = -1
(gdb) l _do_self_heal_on_local_subvol
371                     _do_self_heal_on_subvol (this, i, INDEX);
372     }
373
374     void
375     _do_self_heal_on_local_subvol (xlator_t *this, afr_crawl_type_t crawl)
376     {
377             int             local_child = -1;
378             afr_private_t   *priv = NULL;
379
380             priv = this->private;
(gdb) 
381             local_child = afr_get_local_child (&priv->shd,
382                                                priv->child_count);
383             if (local_child < -1) {
384                    gf_log (this->name, GF_LOG_INFO,
385                            "No local bricks found");
386             }
387             _do_self_heal_on_subvol (this, local_child, FULL);
388     }
389
l afr_get_local_child
72              return;
73      }
74
75      int
76      afr_get_local_child (afr_self_heald_t *shd, unsigned int child_count)
77      {
78              int i = 0;
79              int ret = -1;
80              for (i = 0; i < child_count; i++) {
81                      if (shd->pos[i] == AFR_POS_LOCAL) {
(gdb) 
82                              ret = i;
83                              break;
84                      }
85              }
86              return ret;
87      }




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Bring a brick down, turn off self-heal-daemon
2. After some time bring the brick up and turn on self-heal-daemon
3. give volume heal full
  
Actual results:
glustershd crashed

Expected results:
glustershd should not crash

Additional info:

56:     type debug/io-stats
 57:     subvolumes mirror-replicate-0 mirror-replicate-1
 58: end-volume

+------------------------------------------------------------------------------+
[2012-03-06 05:02:31.655833] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-2: changing port to 24009 (from 0)
[2012-03-06 05:02:31.655931] W [client.c:2011:client_rpc_notify] 0-mirror-client-2: Registering a grace timer
[2012-03-06 05:02:31.656232] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-3: changing port to 24009 (from 0)
[2012-03-06 05:02:31.656273] W [client.c:2011:client_rpc_notify] 0-mirror-client-3: Registering a grace timer
[2012-03-06 05:02:31.656543] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-1: changing port to 24009 (from 0)
[2012-03-06 05:02:31.656579] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-0: changing port to 24009 (from 0)
[2012-03-06 05:02:31.656609] W [client.c:2011:client_rpc_notify] 0-mirror-client-1: Registering a grace timer
[2012-03-06 05:02:31.656626] W [client.c:2011:client_rpc_notify] 0-mirror-client-0: Registering a grace timer
[2012-03-06 05:02:32.709541] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-03-06 05:02:33.731200] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-03-06 05:02:33.746169] I [client.c:2254:client_init_grace_timer] 0-mirror-client-0: lk-heal = on
[2012-03-06 05:02:33.746232] I [client.c:2254:client_init_grace_timer] 0-mirror-client-1: lk-heal = on
[2012-03-06 05:02:33.746290] I [client.c:2254:client_init_grace_timer] 0-mirror-client-2: lk-heal = on
[2012-03-06 05:02:33.746323] I [client.c:2254:client_init_grace_timer] 0-mirror-client-3: lk-heal = on
[2012-03-06 05:02:33.746437] I [glusterfsd-mgmt.c:1297:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-03-06 05:02:33
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0qa25
/lib64/libc.so.6[0x390f432980]
/usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(afr_start_crawl+0x168)[0x7f2a5174d374]
/usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(_do_self_heal_on_subvol+0x97)[0x7f2a5174b2ec]
/usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(_do_self_heal_on_local_subvol+0xb8)[0x7f2a5174b40b]

Comment 1 Amar Tumballi 2012-03-12 09:46:43 UTC

please update these bugs w.r.to 3.3.0qa27, need to work on it as per target milestone set.

Comment 2 Anand Avati 2012-03-18 07:33:34 UTC

CHANGE: http://review.gluster.com/2962 (Self-heald: Handle errors gracefully and show errors to users) merged in master by Anand Avati (avati)

Comment 3 Raghavendra Bhat 2012-04-05 10:46:16 UTC

Checked with glusterfs-3.3.0qa33. Self-heal-daemon did not crash.

Note You need to log in before you can comment on or make changes to this bug.