glustershd crashed since the fresh_children were -1. Setup: On the setup of bugs 3637, 3639 (2 replica volume with 1 fuse and 1 nfs client). ran the tests. On fuse client kernel untar in a while loop On nfs client rm -rf of the untarred kernel Killed one of the bricks, slept and brought the brick up. On other machine volume set was running in a loop. This is the backtrace of the core. Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'. Program terminated with signal 6, Aborted. #0 0x00000030b8e30265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000030b8e30265 in raise () from /lib64/libc.so.6 #1 0x00000030b8e31d10 in abort () from /lib64/libc.so.6 #2 0x00000030b8e296e6 in __assert_fail () from /lib64/libc.so.6 #3 0x00002aaaacf2ba12 in afr_inode_set_read_ctx (this=0x15659800, inode=0x2aaab29e3b38, read_child=1, fresh_children=0x1570a380) at ../../../../../xlators/cluster/afr/src/afr-common.c:419 #4 0x00002aaaacf06dbd in afr_sh_inode_set_read_ctx (sh=0x156fb7e8, this=0x15659800) at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:642 #5 0x00002aaaacf070e4 in afr_sh_data_fix (frame=0x2b771b93cad8, this=0x15659800) at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:705 #6 0x00002aaaacf079b0 in afr_sh_data_fstat_cbk (frame=0x2b771b93cad8, cookie=0x1, this=0x15659800, op_ret=-1, op_errno=107, buf=0x7fff6d689d70) at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:885 #7 0x00002aaaaccb5248 in client3_1_fstat_cbk (req=0x2aaaad36c534, iov=0x7fff6d689ee0, count=1, myframe=0x2b771b6b2790) at ../../../../../xlators/protocol/client/src/client3_1-fops.c:1198 #8 0x00002b771a9f6410 in saved_frames_unwind (saved_frames=0x15666290) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:385 #9 0x00002b771a9f652f in saved_frames_destroy (frames=0x15666290) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:403 #10 0x00002b771a9f6a03 in rpc_clnt_connection_cleanup (conn=0x1565fa60) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:559 #11 0x00002b771a9f7455 in rpc_clnt_notify (trans=0x1565fd60, mydata=0x1565fa60, event=RPC_TRANSPORT_DISCONNECT, data=0x1565fd60) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:863 #12 0x00002b771a9f39f3 in rpc_transport_notify (this=0x1565fd60, event=RPC_TRANSPORT_DISCONNECT, data=0x1565fd60) at ../../../../rpc/rpc-lib/src/rpc-transport.c:498 #13 0x00002aaaaab59006 in socket_event_poll_err (this=0x1565fd60) at ../../../../../rpc/rpc-transport/socket/src/socket.c:694 #14 0x00002aaaaab5d47c in socket_event_handler (fd=27, idx=20, data=0x1565fd60, poll_in=1, poll_out=0, poll_err=24) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1797 #15 0x00002b771a79f84c in event_dispatch_epoll_handler (event_pool=0x1564c960, events=0x156513f0, i=0) at ../../../libglusterfs/src/event.c:794 #16 0x00002b771a79fa51 in event_dispatch_epoll (event_pool=0x1564c960) at ../../../libglusterfs/src/event.c:856 #17 0x00002b771a79fdab in event_dispatch (event_pool=0x1564c960) at ../../../libglusterfs/src/event.c:956 #18 0x000000000040784d in main (argc=11, argv=0x7fff6d68a518) at ../../../glusterfsd/src/glusterfsd.c:1592 (gdb) f 3 #3 0x00002aaaacf2ba12 in afr_inode_set_read_ctx (this=0x15659800, inode=0x2aaab29e3b38, read_child=1, fresh_children=0x1570a380) at ../../../../../xlators/cluster/afr/src/afr-common.c:419 419 GF_ASSERT (afr_is_child_present (fresh_children, priv->child_count, (gdb) l 414 afr_private_t *priv = NULL; 415 416 priv = this->private; 417 GF_ASSERT (read_child >= 0); 418 GF_ASSERT (fresh_children); 419 GF_ASSERT (afr_is_child_present (fresh_children, priv->child_count, 420 read_child)); 421 422 params.op = AFR_INODE_SET_READ_CTX; 423 params.u.read_ctx.read_child = read_child; (gdb) l afr_is_child_present 455 } 456 457 gf_boolean_t 458 afr_is_child_present (int32_t *success_children, int32_t child_count, 459 int32_t child) 460 { 461 gf_boolean_t success_child = _gf_false; 462 int i = 0; 463 464 GF_ASSERT (child < child_count); (gdb) 465 466 for (i = 0; i < child_count; i++) { 467 if (success_children[i] == -1) 468 break; 469 if (child == success_children[i]) { 470 success_child = _gf_true; 471 break; 472 } 473 } 474 return success_child; (gdb) p fresh_children $1 = (int32_t *) 0x1570a380 (gdb) p fresh_children[0] $2 = -1 (gdb) p fresh_children[1] $3 = -1 (gdb) p read_child $4 = 1 (gdb)
This bug is observed because afr_build_sources does not take into account the valid_children in computing the sources afr_sh_data_fix should check for errors in fxattrop and fstat, only if there exist atleast one source and one sink it should proceed with the data fixing.
*** Bug 3083 has been marked as a duplicate of this bug. ***
CHANGE: http://review.gluster.com/2662 (cluster/afr: Handle afr data self-heal failures gracefully) merged in master by Vijay Bellur (vijay)
Tested with glusterfs-3.3.0qa40. Repeated the same test of untar and rm -rf of linux kernel parallely, and bringing brick down and up. Gave volume heal command, did volume set operations. self-heal daemon did not crash.