1318546 – Glusterd crashed just after a peer probe command failed.

Bug 1318546 - Glusterd crashed just after a peer probe command failed.

Summary: Glusterd crashed just after a peer probe command failed.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1320374
TreeView+	depends on / blocked

Reported:	2016-03-17 08:23 UTC by Mohammed Rafi KC
Modified:	2016-06-16 14:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:
Clones:	1320374 (view as bug list)
Environment:
Last Closed:	2016-06-16 14:01:05 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mohammed Rafi KC 2016-03-17 08:23:10 UTC

Description of problem:

If a peer probe command failed because of an unresolvable IP, and just after that if we run gluster volume stop command, both together resulted in glusterd crash.



Version-Release number of selected component (if applicable):

mainline.

How reproducible:
50%

Steps to Reproduce:
1.create a volume.
2.Do a peer probe on an invalid IP (eg: a.b.c.d)
3.stop the volume
4. Or Run the test ./tests/bugs/glusterfs/bug-879490.t from gluster source

Actual results:

Glusterd crashed

Expected results:

Glusterd should not crash

Additional info:

Comment 1 Mohammed Rafi KC 2016-03-17 08:24:06 UTC

(gdb) bt
#0  0x00007f5c0e5c9fc6 in dict_lookup_common (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str") at dict.c:287
#1  0x00007f5c0e5cc6d1 in dict_get_with_ref (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", data=0x7f5c01c5f1c0) at dict.c:1397
#2  0x00007f5c0e5cdae2 in dict_get_str (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", str=0x7f5c01c5f238) at dict.c:2139
#3  0x00007f5c04c4f901 in glusterd_xfer_cli_probe_resp (req=0x7f5bf800093c, op_ret=-1, op_errno=107, op_errstr=0x0, hostname=0x7f5bec09ca70 "a.b.c.d", port=24007, dict=0x7f5bec004e3c)
    at glusterd-handler.c:3944
#4  0x00007f5c04c53107 in glusterd_friend_remove_notify (peerctx=0x7f5bec003de0, op_errno=107) at glusterd-handler.c:5080
#5  0x00007f5c04c537c5 in __glusterd_peer_rpc_notify (rpc=0x7f5bec004310, mydata=0x7f5bec003de0, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-handler.c:5210
#6  0x00007f5c04c44238 in cds_list_add_tail_rcu (newp=0x7f5bec004310, head=0x7f5bec003de0) at ../../../../contrib/userspace-rcu/rculist-extra.h:36
#7  0x00007f5c04c538aa in __glusterd_peer_rpc_notify (rpc=0x7f5c01c60700, mydata=0x7f5c01c60700, event=RPC_CLNT_CONNECT, data=0x1c5fbf0) at glusterd-handler.c:5234
#8  0x00007f5c0e39c5a4 in rpc_clnt_notify (trans=0x7f5bec09e520, mydata=0x7f5bec004340, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-clnt.c:867
#9  0x00007f5c0e398ac9 in rpc_transport_notify (this=0x7f5bec09e520, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-transport.c:541
#10 0x00007f5c03cb34f6 in socket_connect_error_cbk (opaque=0x7f5bf0000b90) at socket.c:2814
#11 0x0000003400607ee5 in start_thread (arg=0x7f5c01c60700) at pthread_create.c:309
#12 0x00000034002f4d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 2 Mohammed Rafi KC 2016-03-17 08:29:44 UTC

glusterd_friend_remove_notify was called two times that resulted in a accessing an already freed dictionary stored in peerctx.args.

I guess glusterd_friend_remove_notify was called second time as part of the automatic timer based reconnect logic. Since the reconnect will always fails this can cause a crash.

Comment 3 Vijay Bellur 2016-03-20 13:35:39 UTC

REVIEW: http://review.gluster.org/13790 (glusterd/rpc : Discard duplicate Disconnect events) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 4 Mohammed Rafi KC 2016-03-22 10:32:08 UTC

Newly created rpc for the friend will be cleared through friend_sm. If the friend_sm takes time more than the reconnect then only this crash can hit.

Comment 5 Vijay Bellur 2016-03-22 19:25:04 UTC

COMMIT: http://review.gluster.org/13790 committed in master by Jeff Darcy (jdarcy) 
------
commit 1081584d4c2d26e56fea623ecfadd305c6e3d3bc
Author: Atin Mukherjee <amukherj>
Date:   Sun Mar 20 18:31:00 2016 +0530

    glusterd/rpc : Discard duplicate Disconnect events
    
    If a peer rpc disconnect event has been already processed, skip the furthers as
    processing them are overheads and sometimes may lead to a crash like due to a
    double free
    
    Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc
    BUG: 1318546
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13790
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 6 Niels de Vos 2016-06-16 14:01:05 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.