1785323 – glusterfsd cashes after a few seconds

Bug 1785323 - glusterfsd cashes after a few seconds

Summary: glusterfsd cashes after a few seconds

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	7
Hardware:	armv7l
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1785611
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-19 17:20 UTC by Robin van Oosten
Modified:	2020-03-16 08:21 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	1785611 (view as bug list)
Environment:
Last Closed:	2020-03-12 14:48:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log files (3.61 MB, application/gzip) 2019-12-19 17:20 UTC, Robin van Oosten	no flags	Details
gdb glusterfsd core (12.79 KB, text/plain) 2019-12-19 20:35 UTC, Robin van Oosten	no flags	Details
gdb glusterfsd core (21.46 KB, text/plain) 2019-12-19 22:35 UTC, Robin van Oosten	no flags	Details
gdb glusterfs core (12.74 KB, text/plain) 2019-12-22 16:43 UTC, Robin van Oosten	no flags	Details
gdb glusterfsd core (13.61 KB, text/plain) 2020-01-13 19:57 UTC, Robin van Oosten	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24207	0	None	Merged	multiple: fix bad type cast	2020-03-16 08:21:31 UTC

Description Robin van Oosten 2019-12-19 17:20:39 UTC

Created attachment 1646633 [details]
log files

Description of problem:
glusterfsd cashes after a few seconds

How reproducible:
After the command "gluster volume start gv0 force" glusterfsd is started but crashes after a few seconds.

Additional info:

OS:		Armbian 5.95 Odroidxu4 Ubuntu bionic default
Kernel:		Linux 4.14.141
Build date:	02.09.2019
Gluster:	7.0
Hardware:	node1 - node4:	Odroid HC2 + WD RED 10TB
		node5:		Odroid HC2 + Samsung SSD 850 EVO 250GB

root@hc2-1:~# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-12-19 13:32:41 CET; 1s ago
     Docs: man:glusterd(8)
  Process: 12734 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, s
 Main PID: 12735 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─12735 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─12772 /usr/sbin/glusterfsd -s hc2-1 --volfile-id gv0.hc2-1.data-brick1-gv0 -p /var/run/gluster/vols/gv0/hc2-1-data
           └─12794 /usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/gl

Dec 19 13:32:37 hc2-1 systemd[1]: Starting GlusterFS, a clustered file-system server...
Dec 19 13:32:41 hc2-1 systemd[1]: Started GlusterFS, a clustered file-system server.


root@hc2-1:~# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-12-19 13:32:41 CET; 15s ago
     Docs: man:glusterd(8)
  Process: 12734 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, s
 Main PID: 12735 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─12735 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           └─12794 /usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/gl

Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: dlfcn 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: libpthread 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: llistxattr 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: setfsid 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: spinlock 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: epoll.h 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: xattr.h 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: st_atim.tv_nsec 1
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: package-string: glusterfs 7.0
Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: ---------
root@hc2-1:~# 


root@hc2-9:~# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hc2-1:/data/brick1/gv0                N/A       N/A        N       N/A  
Brick hc2-2:/data/brick1/gv0                49152     0          Y       1322 
Brick hc2-5:/data/brick1/gv0                49152     0          Y       1767 
Brick hc2-3:/data/brick1/gv0                49152     0          Y       1474 
Brick hc2-4:/data/brick1/gv0                49152     0          Y       1472 
Brick hc2-5:/data/brick2/gv0                49153     0          Y       1787 
Self-heal Daemon on localhost               N/A       N/A        Y       1314 
Self-heal Daemon on hc2-5                   N/A       N/A        Y       1808 
Self-heal Daemon on hc2-3                   N/A       N/A        Y       1485 
Self-heal Daemon on hc2-4                   N/A       N/A        Y       1486 
Self-heal Daemon on hc2-1                   N/A       N/A        Y       13522
Self-heal Daemon on hc2-2                   N/A       N/A        Y       1348 
 
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks



root@hc2-9:~# gluster volume heal gv0 info summary
Brick hc2-1:/data/brick1/gv0
Status: Transport endpoint is not connected
Total Number of entries: -
Number of entries in heal pending: -
Number of entries in split-brain: -
Number of entries possibly healing: -

Brick hc2-2:/data/brick1/gv0
Status: Connected
Total Number of entries: 977
Number of entries in heal pending: 977
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick hc2-5:/data/brick1/gv0
Status: Connected
Total Number of entries: 977
Number of entries in heal pending: 977
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick hc2-3:/data/brick1/gv0
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick hc2-4:/data/brick1/gv0
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick hc2-5:/data/brick2/gv0
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0


root@hc2-9:~# gluster volume info
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 9fcb6792-3899-4802-828f-84f37c026881
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: hc2-1:/data/brick1/gv0
Brick2: hc2-2:/data/brick1/gv0
Brick3: hc2-5:/data/brick1/gv0 (arbiter)
Brick4: hc2-3:/data/brick1/gv0
Brick5: hc2-4:/data/brick1/gv0
Brick6: hc2-5:/data/brick2/gv0 (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet

Comment 1 Xavi Hernandez 2019-12-19 18:22:54 UTC

Currently I can't test it on an ARM machine. Is it possible for you to open the coredump with gdb with symbols loaded and run this command to get some information about the reason of the crash ?

(gdb) t a a bt

Comment 2 Robin van Oosten 2019-12-19 19:23:47 UTC

I can open the coredump with gdb but where do I find the symbols file?

gdb /usr/sbin/glusterfs /core
.
.
.
Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done.

Comment 3 Robin van Oosten 2019-12-19 20:35:15 UTC

Created attachment 1646676 [details]
gdb glusterfsd core

Comment 4 Robin van Oosten 2019-12-19 20:39:17 UTC

After "apt install glusterfs-dbg" I was able to load the symbols file.

Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/.build-id/31/453c4877ad5c7f1a2553147feb1c0816f67654.debug...done.

See attachment 1646676 [details].

Comment 5 Xavi Hernandez 2019-12-19 21:08:39 UTC

You will also need to install debug symbols for libc because it doesn't seem able to correctly decode the backtraces inside that library.

Comment 6 Robin van Oosten 2019-12-19 22:35:19 UTC

Created attachment 1646687 [details]
gdb glusterfsd core

Installed libc6-dbg now.

Comment 7 Xavi Hernandez 2019-12-20 13:28:01 UTC

Thanks for all the information. I think I've identified the issue. I've uploaded a patch [1] to solve this. If you can review and/or test it, it would be great.

Once merged in master branch, I'll backport it to release 7 branch so that it can be available on next release.

[1] https://review.gluster.org/c/glusterfs/+/23912

Comment 8 Robin van Oosten 2019-12-20 18:56:36 UTC

Thanks for the patch. I have never installed Gluster from source but will have a go at it.

Comment 9 jmilette 2019-12-21 22:38:19 UTC

Just adding that I have the same issue - I was able to compile glusterfs with the included patch, but I couldn't get the peers to probe with the new glusterfs version, so couldn't test.  Using glusterfs 6.6 right now.

Comment 10 Robin van Oosten 2019-12-22 16:43:07 UTC

Created attachment 1647181 [details]
gdb glusterfs core

Was able to compile from source with the patch. Now the bricks stay up but the Self-heal Daemon is crashing. Node hc2-9 is still version 7.0 all other node are version 8dev with patch.

root@hc2-1:~# gluster --version
glusterfs 8dev

root@hc2-9:~# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hc2-1:/data/brick1/gv0                49152     0          Y       1564 
Brick hc2-2:/data/brick1/gv0                49152     0          Y       1909 
Brick hc2-5:/data/brick1/gv0                49152     0          Y       2067 
Brick hc2-3:/data/brick1/gv0                49152     0          Y       1780 
Brick hc2-4:/data/brick1/gv0                49152     0          Y       2071 
Brick hc2-5:/data/brick2/gv0                49153     0          Y       2074 
Self-heal Daemon on localhost               N/A       N/A        Y       1312 
Self-heal Daemon on hc2-5                   N/A       N/A        N       N/A  
Self-heal Daemon on hc2-4                   N/A       N/A        N       N/A  
Self-heal Daemon on hc2-3                   N/A       N/A        N       N/A  
Self-heal Daemon on hc2-2                   N/A       N/A        N       N/A  
Self-heal Daemon on hc2-1                   N/A       N/A        N       N/A  
 
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

Comment 11 Xavi Hernandez 2020-01-09 10:17:00 UTC

Good to know that one problem has been fixed. Can you provide the backtrace of self-heal daemon ? it might be a similar problem in another place.

Comment 12 Robin van Oosten 2020-01-11 17:02:43 UTC

Have you looked at attachment 1647181 [details]? Do you need more info?

Comment 13 Xavi Hernandez 2020-01-13 09:20:21 UTC

Thanks Robin.

I think it will be enough. It seems the same problem but on another place. Now I need to identify where it's exactly, but your backtrace seems enough to know where to start.

Comment 14 Xavi Hernandez 2020-01-13 10:55:45 UTC

Are you using IPv6 ?

Can you check the output of following commands from gdb ?

(gdb) thread 1
(gdb) frame 7
(gdb) print host
(gdb) print sin_family
(gdb) print server

Comment 15 Robin van Oosten 2020-01-13 18:53:28 UTC

Both IPv4 and IPv6 are enabled. DNS resolution is done via /etc/hosts files with only IPv4 addresses.

Below info is from a new trace file so may not be the info you asked for.

(gdb) thread 1
[Switching to thread 1 (Thread 0xb20fe700 (LWP 2613))]
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47	in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S
(gdb) frame 7
#7  0xb6f4ac52 in _gf_event (event=event@entry=EVENT_AFR_SUBVOLS_DOWN, fmt=0xb226fbcc "client-pid=%d; subvol=%s")
    at events.c:151
151	}
(gdb) print host
$1 = 0xb1705d30 ""
(gdb) print sin_family
$2 = <optimized out>
(gdb) print server
$3 = {sin_family = 10, sin_port = 51549, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
(gdb)

Comment 16 Robin van Oosten 2020-01-13 19:57:11 UTC

Created attachment 1651993 [details]
gdb glusterfsd core

Comment 17 Xavi Hernandez 2020-01-14 12:44:53 UTC

Apparently this new crash is fundamentally different than what caused this bug, so I opened a new bug #1790870 to track this new problem.

I'll upload a fix soon.

Comment 18 Robin van Oosten 2020-01-17 18:53:50 UTC

I have tested after patch 24014 was merged and glusterd does not start from systemd. (I don't mind for the moment and will look into that later).

When started from the command line all is well, no more crashes.

Thanks for your support and patches.

Comment 19 Xavi Hernandez 2020-01-20 08:08:27 UTC

The patch shouldn't change anything to prevent glusted from starting. However, if you have compiled from source, it's possible that the install patch is /usr/local/sbin instead of /usr/sbin. Maybe the the systemd unit is still looking at /usr/sbin ? 

Most probably, when this is packaged using regular channels, this issue should disappear.

Given it seems to be working, may I close this bug ? (if the systemd problem persists, you can open another bug to analyze it).

Comment 20 Robin van Oosten 2020-01-26 18:02:09 UTC

Yes, no problem if you close this bug. I plan to re-install when Gluster 8.0 will be released, or do you back-port the patches to a 7.x version?

Comment 21 Xavi Hernandez 2020-01-27 07:38:08 UTC

Sorry. I forgot to backport it. I'll do so before closing the bug.

Comment 22 Worker Ant 2020-03-09 07:06:22 UTC

REVIEW: https://review.gluster.org/24207 (multiple: fix bad type cast) posted (#1) for review on release-7 by Xavi Hernandez

Comment 23 Worker Ant 2020-03-12 14:48:07 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1063, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 24 Worker Ant 2020-03-16 08:21:32 UTC

REVIEW: https://review.gluster.org/24207 (multiple: fix bad type cast) merged (#2) on release-7 by Rinku Kothiya

Note You need to log in before you can comment on or make changes to this bug.