Bug 1320374

Summary: Glusterd crashed just after a peer probe command failed.
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.7.9CC: amukherj, bugs, rkavunga, sasundar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1318546 Environment:
Last Closed: 2016-04-19 07:00:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1318546    
Bug Blocks:    

Description Atin Mukherjee 2016-03-23 04:38:33 UTC
+++ This bug was initially created as a clone of Bug #1318546 +++

Description of problem:

If a peer probe command failed because of an unresolvable IP, and just after that if we run gluster volume stop command, both together resulted in glusterd crash.



Version-Release number of selected component (if applicable):

mainline.

How reproducible:
50%

Steps to Reproduce:
1.create a volume.
2.Do a peer probe on an invalid IP (eg: a.b.c.d)
3.stop the volume
4. Or Run the test ./tests/bugs/glusterfs/bug-879490.t from gluster source

Actual results:

Glusterd crashed

Expected results:

Glusterd should not crash

Additional info:

--- Additional comment from Mohammed Rafi KC on 2016-03-17 04:24:06 EDT ---

(gdb) bt
#0  0x00007f5c0e5c9fc6 in dict_lookup_common (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str") at dict.c:287
#1  0x00007f5c0e5cc6d1 in dict_get_with_ref (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", data=0x7f5c01c5f1c0) at dict.c:1397
#2  0x00007f5c0e5cdae2 in dict_get_str (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", str=0x7f5c01c5f238) at dict.c:2139
#3  0x00007f5c04c4f901 in glusterd_xfer_cli_probe_resp (req=0x7f5bf800093c, op_ret=-1, op_errno=107, op_errstr=0x0, hostname=0x7f5bec09ca70 "a.b.c.d", port=24007, dict=0x7f5bec004e3c)
    at glusterd-handler.c:3944
#4  0x00007f5c04c53107 in glusterd_friend_remove_notify (peerctx=0x7f5bec003de0, op_errno=107) at glusterd-handler.c:5080
#5  0x00007f5c04c537c5 in __glusterd_peer_rpc_notify (rpc=0x7f5bec004310, mydata=0x7f5bec003de0, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-handler.c:5210
#6  0x00007f5c04c44238 in cds_list_add_tail_rcu (newp=0x7f5bec004310, head=0x7f5bec003de0) at ../../../../contrib/userspace-rcu/rculist-extra.h:36
#7  0x00007f5c04c538aa in __glusterd_peer_rpc_notify (rpc=0x7f5c01c60700, mydata=0x7f5c01c60700, event=RPC_CLNT_CONNECT, data=0x1c5fbf0) at glusterd-handler.c:5234
#8  0x00007f5c0e39c5a4 in rpc_clnt_notify (trans=0x7f5bec09e520, mydata=0x7f5bec004340, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-clnt.c:867
#9  0x00007f5c0e398ac9 in rpc_transport_notify (this=0x7f5bec09e520, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-transport.c:541
#10 0x00007f5c03cb34f6 in socket_connect_error_cbk (opaque=0x7f5bf0000b90) at socket.c:2814
#11 0x0000003400607ee5 in start_thread (arg=0x7f5c01c60700) at pthread_create.c:309
#12 0x00000034002f4d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

--- Additional comment from Mohammed Rafi KC on 2016-03-17 04:29:44 EDT ---

glusterd_friend_remove_notify was called two times that resulted in a accessing an already freed dictionary stored in peerctx.args.

I guess glusterd_friend_remove_notify was called second time as part of the automatic timer based reconnect logic. Since the reconnect will always fails this can cause a crash.

--- Additional comment from Vijay Bellur on 2016-03-20 09:35:39 EDT ---

REVIEW: http://review.gluster.org/13790 (glusterd/rpc : Discard duplicate Disconnect events) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Mohammed Rafi KC on 2016-03-22 06:32:08 EDT ---

Newly created rpc for the friend will be cleared through friend_sm. If the friend_sm takes time more than the reconnect then only this crash can hit.

--- Additional comment from Vijay Bellur on 2016-03-22 15:25:04 EDT ---

COMMIT: http://review.gluster.org/13790 committed in master by Jeff Darcy (jdarcy) 
------
commit 1081584d4c2d26e56fea623ecfadd305c6e3d3bc
Author: Atin Mukherjee <amukherj>
Date:   Sun Mar 20 18:31:00 2016 +0530

    glusterd/rpc : Discard duplicate Disconnect events
    
    If a peer rpc disconnect event has been already processed, skip the furthers as
    processing them are overheads and sometimes may lead to a crash like due to a
    double free
    
    Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc
    BUG: 1318546
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13790
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 1 Vijay Bellur 2016-03-23 04:39:22 UTC
REVIEW: http://review.gluster.org/13813 (glusterd/rpc : Discard duplicate Disconnect events) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-03-23 14:50:34 UTC
COMMIT: http://review.gluster.org/13813 committed in release-3.7 by Jeff Darcy (jdarcy) 
------
commit 8f5323882d90e3dd4ab855c79737e6d2302fc739
Author: Atin Mukherjee <amukherj>
Date:   Sun Mar 20 18:31:00 2016 +0530

    glusterd/rpc : Discard duplicate Disconnect events
    
    Backport of http://review.gluster.org/#/c/13790/
    
    If a peer rpc disconnect event has been already processed, skip the furthers as
    processing them are overheads and sometimes may lead to a crash like due to a
    double free
    
    Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc
    BUG: 1320374
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13790
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>
    Reviewed-on: http://review.gluster.org/13813

Comment 3 Kaushal 2016-04-19 07:00:49 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.10, please open a new bug report.

glusterfs-3.7.10 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-April/026164.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user