Description of problem: When adding a peer to my cluster that manages disperse volumes of type 2 + 1, there's some type of bad state that causes self-heal daemons to crash across the cluster upon that peer joining the cluster group. Version-Release number of selected component (if applicable): [root@codex01 ~]# gluster --version glusterfs 4.0.1 How reproducible: With my current cluster state, every time. Steps to Reproduce: 1. Starting with glusterd off and all volumes stopped, start glusterd on _two_ nodes. 2. State is now: [root@codex01 ~]# gluster v status knox Status of volume: knox Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick codex01:/srv/storage/disperse-2_1 49152 0 Y 16851 Brick codex02:/srv/storage/disperse-2_1 49152 0 Y 14029 Self-heal Daemon on localhost N/A N/A Y 16873 Bitrot Daemon on localhost N/A N/A Y 16895 Scrubber Daemon on localhost N/A N/A Y 16905 Self-heal Daemon on codex02 N/A N/A Y 14051 Bitrot Daemon on codex02 N/A N/A Y 14060 Scrubber Daemon on codex02 N/A N/A Y 14070 [root@codex01 ~]# gluster v info knox Volume Name: knox Type: Disperse Volume ID: bd295812-4a07-482f-9329-4cafbdf0ad28 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: codex01:/srv/storage/disperse-2_1 Brick2: codex02:/srv/storage/disperse-2_1 Brick3: codex03:/srv/storage/disperse-2_1-fixed Options Reconfigured: cluster.disperse-self-heal-daemon: enable features.scrub: Active features.bitrot: on transport.address-family: inet nfs.disable: on 3. Start glusterd on the third, potentially bad node: ssh root@codex03 systemctl start glusterd 4. This causes the self-heal daemon to crash on all nodes with the following in /var/log/glusterfs/glustershd.log: [2018-05-10 18:15:07.811324] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0 [2018-05-10 18:15:07.812089] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0 [2018-05-10 18:15:07.812822] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-knox-client-2: changing port to 49152 (from 0) [2018-05-10 18:15:07.820841] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0 [2018-05-10 18:15:07.821667] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0 [2018-05-10 18:15:07.825170] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-knox-client-2: Connected to knox-client-2, attached to remote volume '/srv/storage/disperse-2_1-fixed'. [2018-05-10 18:15:07.825458] W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace. The message "W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace." repeated 50 times between [2018-05-10 18:15:07.825458] and [2018-05-10 18:15:07.925122] pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2018-05-10 18:15:07 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 4.0.1 --------- 5. New volune status: [root@codex01 ~]# gluster v status knox Status of volume: knox Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick codex01:/srv/storage/disperse-2_1 49152 0 Y 16851 Brick codex02:/srv/storage/disperse-2_1 49152 0 Y 14029 Brick codex03:/srv/storage/disperse-2_1-fix ed 49152 0 Y 29985 Self-heal Daemon on localhost N/A N/A N N/A Bitrot Daemon on localhost N/A N/A Y 16895 Scrubber Daemon on localhost N/A N/A Y 16905 Self-heal Daemon on codex02 N/A N/A N N/A Bitrot Daemon on codex02 N/A N/A Y 14060 Scrubber Daemon on codex02 N/A N/A Y 14070 Self-heal Daemon on codex03 N/A N/A N N/A Bitrot Daemon on codex03 N/A N/A Y 30033 Scrubber Daemon on codex03 N/A N/A Y 30040 Task Status of Volume knox ------------------------------------------------------------------------------ There are no active volume tasks Actual results: Self-heal daemon crashes Expected results: Self-heal daemon shouldn't crash Additional info: I understand that this may be hard to reproduce as it's likely some sort of bad state codex03 got into, but I didn't want to blow away the cluster in case there was a particular case that I couldn't manage to reproduce again. This occurs on _any_ volume - the daemon survives until a particular node joins and brings down all self-heal daemons. I was directed here from IRC, but if this belongs more correctly in the mailing list, I'm happy to move it over there.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.