Bug 765510 (GLUSTER-3778)

Summary: In pure-replicate volume, src-brick crashed during replace-brick operation.
Product: [Community] GlusterFS Reporter: krishnan parthasarathi <kparthas>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.4CC: gluster-bugs, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description krishnan parthasarathi 2011-11-03 12:00:32 UTC
The source brick (with pump) crashes on an assert conditon on afr_lookup_save_gfid.

Seen on master as on Nov 3, commit id 3200a2be434c462b43bf3ffe0343ddc8900c5d88

Steps to reproduce:
1) Create a pure-replicate volume. (cluster.self-heal-daemon must be on)
2) Start replace-brick operation with one of the replica.

Info:
root@trantor:~# gluster volume info
 
Volume Name: vol
Type: Replicate
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: trantor:/gfs/brick1
Brick2: trantor:/gfs/brick2
Options Reconfigured:
diagnostics.brick-log-level: DEBUG

Last seen activity in self-heal-daemon,
<snip>

[2011-11-03 15:39:08.734125] I [afr-common.c:3479:afr_notify] 0-vol-replicate-0: subvol 1 came up, start crawl
[2011-11-03 15:39:08.734148] I [afr-self-heald.c:487:afr_proactive_self_heal] 0-vol-replicate-0: starting crawl for 1
[2011-11-03 15:39:09.043706] W [socket.c:1510:__socket_proto_state_machine] 0-vol-client-1: reading from socket failed. Error (Transport endpoint is not connected), peer (192.168.1.84:24011)
[2011-11-03 15:39:09.043989] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x13c) [0x7f35adfabe01] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x110) [0x7f35adfab326] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f35adfaadb1]))) 0-vol-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2011-11-03 15:39:08.849370
[2011-11-03 15:39:09.044016] W [client3_1-fops.c:2250:client3_1_lookup_cbk] 0-vol-client-1: remote operation failed: Transport endpoint is not connected. Path: /file12
[2011-11-03 15:39:09.044089] I [client.c:1885:client_rpc_notify] 0-vol-client-1: disconnected

</snip>

Last seen activity on server (brick2),
<snip>

[2011-11-03 15:39:08.845637] D [inodelk.c:297:__inode_unlock_lock] 0-vol-locks:  Matching lock found for unlock
[2011-11-03 15:39:08.849437] D [afr-common.c:128:afr_lookup_xattr_req_prepare] 0-vol-pump: /file12: failed to get the gfid from dict

</snip>

(gdb) bt
#0  0x00007f0136053ba5 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f01360576b0 in abort () at abort.c:92
#2  0x00007f013604ca71 in __assert_fail (assertion=0x7f0130c23bbd "new && !uuid_is_null (new)", file=<value optimized out>, line=145, 
    function=0x7f0130c27a00 "afr_lookup_save_gfid") at assert.c:81
#3  0x00007f0130bf48fa in afr_lookup_save_gfid (dst=0x7f011c00de38 "", new=0x0, inode=0x7f012cea2238) at afr-common.c:145
#4  0x00007f0130bf9a44 in afr_lookup (frame=0x7f01355228d8, this=0xd1ad40, loc=0x7f011c005fa0, xattr_req=0x7f011c00bc90) at afr-common.c:2017
#5  0x00007f0130c189d3 in pump_lookup (frame=0x7f01355228d8, this=0xd1ad40, loc=0x7f011c005fa0, xattr_req=0x7f011c00bc90) at pump.c:1754
#6  0x00007f013098794d in marker_lookup (frame=0x7f0135525d40, this=0xd1c000, loc=0x7f011c005fa0, xattr_req=0x7f011c00bc90) at marker.c:2193
#7  0x00007f013076adf9 in io_stats_lookup (frame=0x7f013553b744, this=0xd1d630, loc=0x7f011c005fa0, xattr_req=0x7f011c00bc90) at io-stats.c:1822
#8  0x00007f0130548828 in server_lookup_resume (frame=0x7f013529b394, bound_xl=0xd1d630) at server3_1-fops.c:2665
#9  0x00007f01305340fe in server_resolve_done (frame=0x7f013529b394) at server-resolve.c:597
#10 0x00007f01305341ff in server_resolve_all (frame=0x7f013529b394) at server-resolve.c:632
#11 0x00007f0130534092 in server_resolve (frame=0x7f013529b394) at server-resolve.c:579
#12 0x00007f01305341d6 in server_resolve_all (frame=0x7f013529b394) at server-resolve.c:628
#13 0x00007f0130533caf in server_resolve_entry (frame=0x7f013529b394) at server-resolve.c:453
#14 0x00007f0130533fa7 in server_resolve (frame=0x7f013529b394) at server-resolve.c:561
#15 0x00007f0130534181 in server_resolve_all (frame=0x7f013529b394) at server-resolve.c:621
#16 0x00007f0130534297 in resolve_and_resume (frame=0x7f013529b394, fn=0x7f01305485d7 <server_lookup_resume>) at server-resolve.c:651
#17 0x00007f013054eae6 in server_lookup (req=0x7f0136f3904c) at server3_1-fops.c:5119
#18 0x00007f01369e6170 in rpcsvc_handle_rpc_call (svc=0xd21500, trans=0xf7f960, msg=0x7f011c00a200) at rpcsvc.c:507
#19 0x00007f01369e6513 in rpcsvc_notify (trans=0xf7f960, mydata=0xd21500, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f011c00a200) at rpcsvc.c:603
#20 0x00007f01369ebfa9 in rpc_transport_notify (this=0xf7f960, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f011c00a200) at rpc-transport.c:498
#21 0x00007f0133f3a3cc in socket_event_poll_in (this=0xf7f960) at socket.c:1675
#22 0x00007f0133f3a950 in socket_event_handler (fd=21, idx=6, data=0xf7f960, poll_in=1, poll_out=0, poll_err=0) at socket.c:1790
#23 0x00007f0136c44d92 in event_dispatch_epoll_handler (event_pool=0xd0a150, events=0xd0ede0, i=0) at event.c:794
#24 0x00007f0136c44fb5 in event_dispatch_epoll (event_pool=0xd0a150) at event.c:856
#25 0x00007f0136c45340 in event_dispatch (event_pool=0xd0a150) at event.c:956
#26 0x0000000000407d2c in main (argc=17, argv=0x7ffffbf9a578) at glusterfsd.c:1592

Comment 1 Pranith Kumar K 2011-11-09 06:19:27 UTC
This does not happen after the fix to 3783