Description of problem: If a peer probe command failed because of an unresolvable IP, and just after that if we run gluster volume stop command, both together resulted in glusterd crash. Version-Release number of selected component (if applicable): mainline. How reproducible: 50% Steps to Reproduce: 1.create a volume. 2.Do a peer probe on an invalid IP (eg: a.b.c.d) 3.stop the volume 4. Or Run the test ./tests/bugs/glusterfs/bug-879490.t from gluster source Actual results: Glusterd crashed Expected results: Glusterd should not crash Additional info:
(gdb) bt #0 0x00007f5c0e5c9fc6 in dict_lookup_common (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str") at dict.c:287 #1 0x00007f5c0e5cc6d1 in dict_get_with_ref (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", data=0x7f5c01c5f1c0) at dict.c:1397 #2 0x00007f5c0e5cdae2 in dict_get_str (this=0x7f5bec004e3c, key=0x7f5c04d56970 "cmd-str", str=0x7f5c01c5f238) at dict.c:2139 #3 0x00007f5c04c4f901 in glusterd_xfer_cli_probe_resp (req=0x7f5bf800093c, op_ret=-1, op_errno=107, op_errstr=0x0, hostname=0x7f5bec09ca70 "a.b.c.d", port=24007, dict=0x7f5bec004e3c) at glusterd-handler.c:3944 #4 0x00007f5c04c53107 in glusterd_friend_remove_notify (peerctx=0x7f5bec003de0, op_errno=107) at glusterd-handler.c:5080 #5 0x00007f5c04c537c5 in __glusterd_peer_rpc_notify (rpc=0x7f5bec004310, mydata=0x7f5bec003de0, event=RPC_CLNT_DISCONNECT, data=0x0) at glusterd-handler.c:5210 #6 0x00007f5c04c44238 in cds_list_add_tail_rcu (newp=0x7f5bec004310, head=0x7f5bec003de0) at ../../../../contrib/userspace-rcu/rculist-extra.h:36 #7 0x00007f5c04c538aa in __glusterd_peer_rpc_notify (rpc=0x7f5c01c60700, mydata=0x7f5c01c60700, event=RPC_CLNT_CONNECT, data=0x1c5fbf0) at glusterd-handler.c:5234 #8 0x00007f5c0e39c5a4 in rpc_clnt_notify (trans=0x7f5bec09e520, mydata=0x7f5bec004340, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-clnt.c:867 #9 0x00007f5c0e398ac9 in rpc_transport_notify (this=0x7f5bec09e520, event=RPC_TRANSPORT_DISCONNECT, data=0x7f5bec09e520) at rpc-transport.c:541 #10 0x00007f5c03cb34f6 in socket_connect_error_cbk (opaque=0x7f5bf0000b90) at socket.c:2814 #11 0x0000003400607ee5 in start_thread (arg=0x7f5c01c60700) at pthread_create.c:309 #12 0x00000034002f4d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
glusterd_friend_remove_notify was called two times that resulted in a accessing an already freed dictionary stored in peerctx.args. I guess glusterd_friend_remove_notify was called second time as part of the automatic timer based reconnect logic. Since the reconnect will always fails this can cause a crash.
REVIEW: http://review.gluster.org/13790 (glusterd/rpc : Discard duplicate Disconnect events) posted (#1) for review on master by Atin Mukherjee (amukherj)
Newly created rpc for the friend will be cleared through friend_sm. If the friend_sm takes time more than the reconnect then only this crash can hit.
COMMIT: http://review.gluster.org/13790 committed in master by Jeff Darcy (jdarcy) ------ commit 1081584d4c2d26e56fea623ecfadd305c6e3d3bc Author: Atin Mukherjee <amukherj> Date: Sun Mar 20 18:31:00 2016 +0530 glusterd/rpc : Discard duplicate Disconnect events If a peer rpc disconnect event has been already processed, skip the furthers as processing them are overheads and sometimes may lead to a crash like due to a double free Change-Id: Iec589ce85daf28fd5b267cb6fc82a4238e0e8adc BUG: 1318546 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: http://review.gluster.org/13790 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Jeff Darcy <jdarcy>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user