1576927 – Self heal daemon crash triggered by particular host

Bug 1576927 - Self heal daemon crash triggered by particular host

Summary: Self heal daemon crash triggered by particular host

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	selfheal
Sub Component:
Version:	4.0
Hardware:	armv7l
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-10 18:21 UTC by langlois.tyler
Modified:	2018-06-20 18:26 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-20 18:26:50 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description langlois.tyler 2018-05-10 18:21:06 UTC

Description of problem:

When adding a peer to my cluster that manages disperse volumes of type 2 + 1, there's some type of bad state that causes self-heal daemons to crash across the cluster upon that peer joining the cluster group.

Version-Release number of selected component (if applicable):

[root@codex01 ~]# gluster --version
glusterfs 4.0.1

How reproducible:

With my current cluster state, every time.

Steps to Reproduce:

1. Starting with glusterd off and all volumes stopped, start glusterd on _two_ nodes.

2. State is now:
[root@codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Self-heal Daemon on localhost               N/A       N/A        Y       16873
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        Y       14051
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
[root@codex01 ~]# gluster v info knox
Volume Name: knox
Type: Disperse
Volume ID: bd295812-4a07-482f-9329-4cafbdf0ad28
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: codex01:/srv/storage/disperse-2_1
Brick2: codex02:/srv/storage/disperse-2_1
Brick3: codex03:/srv/storage/disperse-2_1-fixed
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
transport.address-family: inet
nfs.disable: on


3. Start glusterd on the third, potentially bad node: ssh root@codex03 systemctl start glusterd

4. This causes the self-heal daemon to crash on all nodes with the following in /var/log/glusterfs/glustershd.log:
[2018-05-10 18:15:07.811324] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.812089] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.812822] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-knox-client-2: changing port to 49152 (from 0)
[2018-05-10 18:15:07.820841] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.821667] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-knox-client-2: error returned while attempting to connect to host:(null), port:0
[2018-05-10 18:15:07.825170] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-knox-client-2: Connected to knox-client-2, attached to remote volume '/srv/storage/disperse-2_1-fixed'.
[2018-05-10 18:15:07.825458] W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the backtrace." repeated 50 times between [2018-05-10 18:15:07.825458] and [2018-05-10 18:15:07.925122]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2018-05-10 18:15:07
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.0.1
---------

5. New volune status:
[root@codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Brick codex03:/srv/storage/disperse-2_1-fix
ed                                          49152     0          Y       29985
Self-heal Daemon on localhost               N/A       N/A        N       N/A
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        N       N/A
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
Self-heal Daemon on codex03                 N/A       N/A        N       N/A
Bitrot Daemon on codex03                    N/A       N/A        Y       30033
Scrubber Daemon on codex03                  N/A       N/A        Y       30040

Task Status of Volume knox
------------------------------------------------------------------------------
There are no active volume tasks


Actual results:

Self-heal daemon crashes

Expected results:

Self-heal daemon shouldn't crash

Additional info:

I understand that this may be hard to reproduce as it's likely some sort of bad state codex03 got into, but I didn't want to blow away the cluster in case there was a particular case that I couldn't manage to reproduce again. This occurs on _any_ volume - the daemon survives until a particular node joins and brings down all self-heal daemons. 

I was directed here from IRC, but if this belongs more correctly in the mailing list, I'm happy to move it over there.

Comment 1 Shyamsundar 2018-06-20 18:26:50 UTC

This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Note You need to log in before you can comment on or make changes to this bug.