Description of problem: 2x2 distributed replicate volume. 1 fuse and 1 nfs client. On fuse client was running rdd, ping_pong, fs-perf-test and threaded-io in a loop. Brought down a brick. Turned self-heal-daemon off. After some time brought the brick up, turned on self-heal-daemon and gave volume heal start full. glustershd crashed at afr_start_crawl (ping_pong was running on the client at the time of crash). This is the backtrace of the core. Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'. Program terminated with signal 11, Segmentation fault. #0 0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047 1047 gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s", Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 (gdb) bt #0 0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047 #1 0x00007f2a5174b2ec in _do_self_heal_on_subvol (this=0x23ceab0, child=-1, crawl=FULL) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:358 #2 0x00007f2a5174b40b in _do_self_heal_on_local_subvol (this=0x23ceab0, crawl=FULL) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:387 #3 0x00007f2a5174b689 in afr_xl_op (this=0x23ceab0, input=0x7f2a440009c0, output=0x7f2a440011b0) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:448 #4 0x00007f2a5175e888 in afr_notify (this=0x23ceab0, event=14, data=0x7f2a440009c0, data2=0x7f2a440011b0) at ../../../../../xlators/cluster/afr/src/afr-common.c:3507 #5 0x00007f2a5175f9bd in notify (this=0x23ceab0, event=14, data=0x7f2a440009c0) at ../../../../../xlators/cluster/afr/src/afr.c:51 #6 0x000000000040a215 in glusterfs_handle_translator_op (data=0x23add9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:726 #7 0x00007f2a55d34753 in synctask_wrap (old_task=0x24b4430) at ../../../libglusterfs/src/syncop.c:144 #8 0x000000390f443690 in ?? () from /lib64/libc.so.6 #9 0x0000000000000000 in ?? () (gdb) f 0 #0 0x00007f2a5174d374 in afr_start_crawl (this=0x23ceab0, idx=-1, crawl=FULL, process_entry=0x7f2a5174b062 <_self_heal_entry>, op_data=0x0, exclusive=_gf_true, crawl_flags=1, crawl_done=0x7f2a5174b220 <afr_crawl_done>) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:1047 1047 gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s", (gdb) l 1042 crawl_data->child = idx; 1043 crawl_data->pid = frame->root->pid; 1044 crawl_data->crawl = crawl; 1045 crawl_data->op_data = op_data; 1046 crawl_data->crawl_flags = crawl_flags; 1047 gf_log (this->name, GF_LOG_INFO, "starting crawl %d for %s", 1048 crawl_data->crawl, priv->children[idx]->name); 1049 1050 if (exclusive) 1051 crawler = afr_dir_exclusive_crawl; (gdb) p this->name $1 = 0x23cdd60 "mirror-replicate-1" (gdb) p crawl_data->crawl $2 = FULL (gdb) p priv->children[idx]->name Cannot access memory at address 0x0 (gdb) p idx $3 = -1 (gdb) f 1 #1 0x00007f2a5174b2ec in _do_self_heal_on_subvol (this=0x23ceab0, child=-1, crawl=FULL) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:358 358 afr_start_crawl (this, child, crawl, _self_heal_entry, (gdb) f 2 #2 0x00007f2a5174b40b in _do_self_heal_on_local_subvol (this=0x23ceab0, crawl=FULL) at ../../../../../xlators/cluster/afr/src/afr-self-heald.c:387 387 _do_self_heal_on_subvol (this, local_child, FULL); (gdb) p local_child $4 = -1 (gdb) l _do_self_heal_on_local_subvol 371 _do_self_heal_on_subvol (this, i, INDEX); 372 } 373 374 void 375 _do_self_heal_on_local_subvol (xlator_t *this, afr_crawl_type_t crawl) 376 { 377 int local_child = -1; 378 afr_private_t *priv = NULL; 379 380 priv = this->private; (gdb) 381 local_child = afr_get_local_child (&priv->shd, 382 priv->child_count); 383 if (local_child < -1) { 384 gf_log (this->name, GF_LOG_INFO, 385 "No local bricks found"); 386 } 387 _do_self_heal_on_subvol (this, local_child, FULL); 388 } 389 l afr_get_local_child 72 return; 73 } 74 75 int 76 afr_get_local_child (afr_self_heald_t *shd, unsigned int child_count) 77 { 78 int i = 0; 79 int ret = -1; 80 for (i = 0; i < child_count; i++) { 81 if (shd->pos[i] == AFR_POS_LOCAL) { (gdb) 82 ret = i; 83 break; 84 } 85 } 86 return ret; 87 } Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Bring a brick down, turn off self-heal-daemon 2. After some time bring the brick up and turn on self-heal-daemon 3. give volume heal full Actual results: glustershd crashed Expected results: glustershd should not crash Additional info: 56: type debug/io-stats 57: subvolumes mirror-replicate-0 mirror-replicate-1 58: end-volume +------------------------------------------------------------------------------+ [2012-03-06 05:02:31.655833] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-2: changing port to 24009 (from 0) [2012-03-06 05:02:31.655931] W [client.c:2011:client_rpc_notify] 0-mirror-client-2: Registering a grace timer [2012-03-06 05:02:31.656232] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-3: changing port to 24009 (from 0) [2012-03-06 05:02:31.656273] W [client.c:2011:client_rpc_notify] 0-mirror-client-3: Registering a grace timer [2012-03-06 05:02:31.656543] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-1: changing port to 24009 (from 0) [2012-03-06 05:02:31.656579] I [rpc-clnt.c:1665:rpc_clnt_reconfig] 0-mirror-client-0: changing port to 24009 (from 0) [2012-03-06 05:02:31.656609] W [client.c:2011:client_rpc_notify] 0-mirror-client-1: Registering a grace timer [2012-03-06 05:02:31.656626] W [client.c:2011:client_rpc_notify] 0-mirror-client-0: Registering a grace timer [2012-03-06 05:02:32.709541] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-03-06 05:02:33.731200] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-03-06 05:02:33.746169] I [client.c:2254:client_init_grace_timer] 0-mirror-client-0: lk-heal = on [2012-03-06 05:02:33.746232] I [client.c:2254:client_init_grace_timer] 0-mirror-client-1: lk-heal = on [2012-03-06 05:02:33.746290] I [client.c:2254:client_init_grace_timer] 0-mirror-client-2: lk-heal = on [2012-03-06 05:02:33.746323] I [client.c:2254:client_init_grace_timer] 0-mirror-client-3: lk-heal = on [2012-03-06 05:02:33.746437] I [glusterfsd-mgmt.c:1297:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-03-06 05:02:33 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0qa25 /lib64/libc.so.6[0x390f432980] /usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(afr_start_crawl+0x168)[0x7f2a5174d374] /usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(_do_self_heal_on_subvol+0x97)[0x7f2a5174b2ec] /usr/local/lib/glusterfs/3.3.0qa25/xlator/cluster/replicate.so(_do_self_heal_on_local_subvol+0xb8)[0x7f2a5174b40b]
please update these bugs w.r.to 3.3.0qa27, need to work on it as per target milestone set.
CHANGE: http://review.gluster.com/2962 (Self-heald: Handle errors gracefully and show errors to users) merged in master by Anand Avati (avati)
Checked with glusterfs-3.3.0qa33. Self-heal-daemon did not crash.