Bug 1576927

Summary:	Self heal daemon crash triggered by particular host
Product:	[Community] GlusterFS	Reporter:	langlois.tyler
Component:	selfheal	Assignee:	bugs <bugs>
Status:	CLOSED EOL	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.0	CC:	bugs
Target Milestone:	---
Target Release:	---
Hardware:	armv7l
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-20 18:26:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description langlois.tyler 2018-05-10 18:21:06 UTC

Description of problem:

When adding a peer to my cluster that manages disperse volumes of type 2 + 1, there's some type of bad state that causes self-heal daemons to crash across the cluster upon that peer joining the cluster group.

Version-Release number of selected component (if applicable):

[root@codex01 ~]# gluster --version
glusterfs 4.0.1

How reproducible:

With my current cluster state, every time.

Steps to Reproduce:

1. Starting with glusterd off and all volumes stopped, start glusterd on _two_ nodes.

2. State is now:
[root@codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Self-heal Daemon on localhost               N/A       N/A        Y       16873
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        Y       14051
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
[root@codex01 ~]# gluster v info knox
Volume Name: knox
Type: Disperse
Volume ID: bd295812-4a07-482f-9329-4cafbdf0ad28
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: codex01:/srv/storage/disperse-2_1
Brick2: codex02:/srv/storage/disperse-2_1
Brick3: codex03:/srv/storage/disperse-2_1-fixed
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
transport.address-family: inet
nfs.disable: on


3. Start glusterd on the third, potentially bad node: ssh root@codex03 systemctl start glusterd

4. This causes the self-heal daemon to crash on all nodes with the following in /var/log/glusterfs/glustershd.log:
[2018-05-10 18:15:07.811324] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.812089] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.812822] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-knox-client-2: changing port to 49152 (from 0)
[2018-05-10 18:15:07.820841] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.821667] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.825170] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-knox-client-2: Connected to knox-client-2, attached to remote volume '/srv/storage/disperse-2_1-fixed'.
[2018-05-10 18:15:07.825458] W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace." repeated 50 times between [2018-05-10 18:15:07.825458] and [2018-05-10 18:15:07.925122]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2018-05-10 18:15:07
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.0.1
---------

5. New volune status:
[root@codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Brick codex03:/srv/storage/disperse-2_1-fix
ed                                          49152     0          Y       29985
Self-heal Daemon on localhost               N/A       N/A        N       N/A
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        N       N/A
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
Self-heal Daemon on codex03                 N/A       N/A        N       N/A
Bitrot Daemon on codex03                    N/A       N/A        Y       30033
Scrubber Daemon on codex03                  N/A       N/A        Y       30040

Task Status of Volume knox
------------------------------------------------------------------------------
There are no active volume tasks


Actual results:

Self-heal daemon crashes

Expected results:

Self-heal daemon shouldn't crash

Additional info:

I understand that this may be hard to reproduce as it's likely some sort of bad state codex03 got into, but I didn't want to blow away the cluster in case there was a particular case that I couldn't manage to reproduce again. This occurs on _any_ volume - the daemon survives until a particular node joins and brings down all self-heal daemons. 

I was directed here from IRC, but if this belongs more correctly in the mailing list, I'm happy to move it over there.

Comment 1 Shyamsundar 2018-06-20 18:26:50 UTC

This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.