Description of problem: 2x2 distributed replicate setup. 1 fuse and 1 nfs client. While some tests were running on the fuse client gave replace-brick and the source brick crashed. This is the backtrace of the core. Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id mirror.10.1.11.130.export-'. Program terminated with signal 11, Segmentation fault. #0 0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:452 452 trav = xattr->members_list; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 (gdb) bt #0 0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:452 #1 0x00007f30a6baeb4e in fop_fxattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:493 #2 0x00007f30a6baf46e in index_fxattrop_cbk (frame=0x7f30aab05350, cookie=0x7f30aab052a4, this=0x22b2800, op_ret=0, op_errno=2, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:648 #3 0x00007f30a6e21526 in afr_fxattrop_cbk (frame=0x7f30aab052a4, cookie=0x7f30aaaf06bc, this=0x22b14f0, op_ret=-1, op_errno=2, xattr=0x0) at ../../../../../xlators/cluster/afr/src/afr-common.c:2704 #4 0x00007f30a7079147 in client3_1_fxattrop_cbk (req=0x7f309c0436b8, iov=0x7f309c0436f8, count=1, myframe=0x7f30aaaf06bc) at ../../../../../xlators/protocol/client/src/client3_1-fops.c:1462 #5 0x00007f30aba25919 in rpc_clnt_handle_reply (clnt=0x7f309c001470, pollin=0x2368400) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:796 #6 0x00007f30aba25cb6 in rpc_clnt_notify (trans=0x7f309c128620, mydata=0x7f309c0014a0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2368400) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:915 #7 0x00007f30aba21da8 in rpc_transport_notify (this=0x7f309c128620, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2368400) at ../../../../rpc/rpc-lib/src/rpc-transport.c:498 #8 0x00007f30a8732270 in socket_event_poll_in (this=0x7f309c128620) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1686 #9 0x00007f30a87327f4 in socket_event_handler (fd=14, idx=4, data=0x7f309c128620, poll_in=1, poll_out=0, poll_err=0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1801 #10 0x00007f30abc7c05c in event_dispatch_epoll_handler (event_pool=0x228bc20, events=0x22a55f0, i=0) at ../../../libglusterfs/src/event.c:794 #11 0x00007f30abc7c27f in event_dispatch_epoll (event_pool=0x228bc20) at ../../../libglusterfs/src/event.c:856 #12 0x00007f30abc7c60a in event_dispatch (event_pool=0x228bc20) at ../../../libglusterfs/src/event.c:956 #13 0x0000000000407dcc in main (argc=19, argv=0x7fffffb51738) at ../../../glusterfsd/src/glusterfsd.c:1612 (gdb) f 0 #0 0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:452 452 trav = xattr->members_list; (gdb) p xattr $1 = (dict_t *) 0x0 (gdb) up #1 0x00007f30a6baeb4e in fop_fxattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:493 493 _xattrop_index_action (this, inode, xattr); (gdb) p xattr $2 = (dict_t *) 0x0 (gdb) up #2 0x00007f30a6baf46e in index_fxattrop_cbk (frame=0x7f30aab05350, cookie=0x7f30aab052a4, this=0x22b2800, op_ret=0, op_errno=2, xattr=0x0) at ../../../../../xlators/features/index/src/index.c:648 648 fop_fxattrop_index_action (this, frame->local, xattr); (gdb) p xattr $3 = (dict_t *) 0x0 (gdb) up #3 0x00007f30a6e21526 in afr_fxattrop_cbk (frame=0x7f30aab052a4, cookie=0x7f30aaaf06bc, this=0x22b14f0, op_ret=-1, op_errno=2, xattr=0x0) at ../../../../../xlators/cluster/afr/src/afr-common.c:2704 2704 AFR_STACK_UNWIND (fxattrop, frame, local->op_ret, local->op_errno, (gdb) p xattr $4 = (dict_t *) 0x0 (gdb) p op_ret $5 = -1 (gdb) p local->op_ret $6 = 0 (gdb) l afr_fxattrop_cbk 2680 2681 int32_t 2682 afr_fxattrop_cbk (call_frame_t *frame, void *cookie, 2683 xlator_t *this, int32_t op_ret, int32_t op_errno, 2684 dict_t *xattr) 2685 { 2686 afr_local_t *local = NULL; 2687 2688 int call_count = -1; 2689 (gdb) 2690 local = frame->local; 2691 2692 LOCK (&frame->lock); 2693 { 2694 if (op_ret == 0) 2695 local->op_ret = 0; 2696 2697 local->op_errno = op_errno; 2698 } 2699 UNLOCK (&frame->lock); (gdb) 2700 2701 call_count = afr_frame_return (frame); 2702 2703 if (call_count == 0) 2704 AFR_STACK_UNWIND (fxattrop, frame, local->op_ret, local->op_errno, 2705 xattr); 2706 2707 return 0; 2708 } 2709 (gdb) In afr_{f}xattrop_cbk we are not saving the xattr we have received from the subvolumes. Suppose the 1st subvolume returened success with op_ret 0 and non null xattr. Now since we are not storing xattr in the local only local->op_ret is set to 0. IF the op on the next subvolume fails, then op_ret is -1, but we are not storing it in local (since one of the subvolumes returned success), but will be sending the NULL xattr. xlators above afr might segfault when they see op_ret to be 0 and assume that xattr to be present. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create a volume and start it 2. mount it and put some data into the volume 3. give replace-brick Actual results: The source brick crashes Expected results: The source brick should not crash. Additional info: gluster volume info mirror Volume Name: mirror Type: Distributed-Replicate Volume ID: f7ab6a61-4629-43e0-92c0-890f425b6afe Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.1.11.130:/export-xfs/mirror Brick2: 10.1.11.131:/export-xfs/mirror Brick3: 10.1.11.144:/export-xfs/mirror Brick4: 10.1.11.145:/export-xfs/mirror Options Reconfigured: cluster.self-heal-daemon: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on geo-replication.indexing: on features.limit-usage: /playground:22GB features.quota: on performance.stat-prefetch: on
CHANGE: http://review.gluster.com/2813 (cluster/afr: save the xattr obtained in the {f}xattrop_cbk in local) merged in master by Vijay Bellur (vijay)
Checked with glusterfs-3.3.0qa40. Now replace-brick command does not give this crash since we are properly storing the xattrs that have been returned.