Bug 797119

Summary: [glusterfs-3.3.0qa24]: glusterfs crashed in xattrop when replace-brick is given
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Raghavendra Bhat <rabhat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: medium    
Version: mainlineCC: gluster-bugs, pkarampu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:32:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: glusterfs-3.3.0qa40 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Raghavendra Bhat 2012-02-24 10:35:45 UTC
Description of problem:
2x2 distributed replicate setup. 1 fuse and 1 nfs client. While some tests were running on the fuse client gave replace-brick and the source brick crashed.

This is the backtrace of the core.

Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id mirror.10.1.11.130.export-'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:452
452             trav = xattr->members_list;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64
(gdb) bt
#0  0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:452
#1  0x00007f30a6baeb4e in fop_fxattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:493
#2  0x00007f30a6baf46e in index_fxattrop_cbk (frame=0x7f30aab05350, cookie=0x7f30aab052a4, this=0x22b2800, op_ret=0, op_errno=2, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:648
#3  0x00007f30a6e21526 in afr_fxattrop_cbk (frame=0x7f30aab052a4, cookie=0x7f30aaaf06bc, this=0x22b14f0, op_ret=-1, op_errno=2, xattr=0x0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:2704
#4  0x00007f30a7079147 in client3_1_fxattrop_cbk (req=0x7f309c0436b8, iov=0x7f309c0436f8, count=1, myframe=0x7f30aaaf06bc)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:1462
#5  0x00007f30aba25919 in rpc_clnt_handle_reply (clnt=0x7f309c001470, pollin=0x2368400) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:796
#6  0x00007f30aba25cb6 in rpc_clnt_notify (trans=0x7f309c128620, mydata=0x7f309c0014a0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2368400)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:915
#7  0x00007f30aba21da8 in rpc_transport_notify (this=0x7f309c128620, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2368400)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:498
#8  0x00007f30a8732270 in socket_event_poll_in (this=0x7f309c128620) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1686
#9  0x00007f30a87327f4 in socket_event_handler (fd=14, idx=4, data=0x7f309c128620, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1801
#10 0x00007f30abc7c05c in event_dispatch_epoll_handler (event_pool=0x228bc20, events=0x22a55f0, i=0) at ../../../libglusterfs/src/event.c:794
#11 0x00007f30abc7c27f in event_dispatch_epoll (event_pool=0x228bc20) at ../../../libglusterfs/src/event.c:856
#12 0x00007f30abc7c60a in event_dispatch (event_pool=0x228bc20) at ../../../libglusterfs/src/event.c:956
#13 0x0000000000407dcc in main (argc=19, argv=0x7fffffb51738) at ../../../glusterfsd/src/glusterfsd.c:1612
(gdb)  f 0
#0  0x00007f30a6bae956 in _xattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:452
452             trav = xattr->members_list;
(gdb) p xattr
$1 = (dict_t *) 0x0
(gdb) up
#1  0x00007f30a6baeb4e in fop_fxattrop_index_action (this=0x22b2800, inode=0x7f30a4366130, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:493
493             _xattrop_index_action (this, inode, xattr);
(gdb) p xattr
$2 = (dict_t *) 0x0
(gdb) up
#2  0x00007f30a6baf46e in index_fxattrop_cbk (frame=0x7f30aab05350, cookie=0x7f30aab052a4, this=0x22b2800, op_ret=0, op_errno=2, xattr=0x0)
    at ../../../../../xlators/features/index/src/index.c:648
648             fop_fxattrop_index_action (this, frame->local, xattr);
(gdb) p xattr
$3 = (dict_t *) 0x0
(gdb) up
#3  0x00007f30a6e21526 in afr_fxattrop_cbk (frame=0x7f30aab052a4, cookie=0x7f30aaaf06bc, this=0x22b14f0, op_ret=-1, op_errno=2, xattr=0x0)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:2704
2704                    AFR_STACK_UNWIND (fxattrop, frame, local->op_ret, local->op_errno,
(gdb) p xattr
$4 = (dict_t *) 0x0
(gdb) p op_ret
$5 = -1
(gdb) p local->op_ret
$6 = 0
(gdb) l afr_fxattrop_cbk
2680
2681    int32_t
2682    afr_fxattrop_cbk (call_frame_t *frame, void *cookie,
2683                      xlator_t *this, int32_t op_ret, int32_t op_errno,
2684                      dict_t *xattr)
2685    {
2686            afr_local_t *local = NULL;
2687
2688            int call_count = -1;
2689
(gdb) 
2690            local = frame->local;
2691
2692            LOCK (&frame->lock);
2693            {
2694                    if (op_ret == 0)
2695                            local->op_ret = 0;
2696
2697                    local->op_errno = op_errno;
2698            }
2699            UNLOCK (&frame->lock);
(gdb) 
2700
2701            call_count = afr_frame_return (frame);
2702
2703            if (call_count == 0)
2704                    AFR_STACK_UNWIND (fxattrop, frame, local->op_ret, local->op_errno,
2705                                      xattr);
2706
2707            return 0;
2708    }
2709
(gdb) 

In afr_{f}xattrop_cbk we are not saving the xattr we have received from the subvolumes. Suppose the 1st subvolume returened success with op_ret 0 and non null xattr. Now since we are not storing xattr in the local only local->op_ret is set to 0. IF the op on the next subvolume fails, then op_ret is -1, but we are not storing it in local (since one of the subvolumes returned success), but will be sending the NULL xattr.

xlators above afr might segfault when they see op_ret to be 0 and assume that xattr to be present.




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. create a volume and start it
2. mount it and put some data into the volume
3. give replace-brick
  
Actual results:

The source brick crashes

Expected results:

The source brick should not crash.

Additional info:

gluster volume info mirror
 
Volume Name: mirror
Type: Distributed-Replicate
Volume ID: f7ab6a61-4629-43e0-92c0-890f425b6afe
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.1.11.130:/export-xfs/mirror
Brick2: 10.1.11.131:/export-xfs/mirror
Brick3: 10.1.11.144:/export-xfs/mirror
Brick4: 10.1.11.145:/export-xfs/mirror
Options Reconfigured:
cluster.self-heal-daemon: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
geo-replication.indexing: on
features.limit-usage: /playground:22GB
features.quota: on
performance.stat-prefetch: on

Comment 1 Anand Avati 2012-03-12 12:23:04 UTC
CHANGE: http://review.gluster.com/2813 (cluster/afr: save the xattr obtained in the {f}xattrop_cbk in local) merged in master by Vijay Bellur (vijay)

Comment 2 Raghavendra Bhat 2012-05-09 09:57:01 UTC
Checked with glusterfs-3.3.0qa40. Now replace-brick command does not give this crash since we are properly storing the xattrs that have been returned.