Created attachment 1646633 [details] log files Description of problem: glusterfsd cashes after a few seconds How reproducible: After the command "gluster volume start gv0 force" glusterfsd is started but crashes after a few seconds. Additional info: OS: Armbian 5.95 Odroidxu4 Ubuntu bionic default Kernel: Linux 4.14.141 Build date: 02.09.2019 Gluster: 7.0 Hardware: node1 - node4: Odroid HC2 + WD RED 10TB node5: Odroid HC2 + Samsung SSD 850 EVO 250GB root@hc2-1:~# systemctl status glusterd ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2019-12-19 13:32:41 CET; 1s ago Docs: man:glusterd(8) Process: 12734 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, s Main PID: 12735 (glusterd) CGroup: /system.slice/glusterd.service ├─12735 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO ├─12772 /usr/sbin/glusterfsd -s hc2-1 --volfile-id gv0.hc2-1.data-brick1-gv0 -p /var/run/gluster/vols/gv0/hc2-1-data └─12794 /usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/gl Dec 19 13:32:37 hc2-1 systemd[1]: Starting GlusterFS, a clustered file-system server... Dec 19 13:32:41 hc2-1 systemd[1]: Started GlusterFS, a clustered file-system server. root@hc2-1:~# systemctl status glusterd ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/lib/systemd/system/glusterd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2019-12-19 13:32:41 CET; 15s ago Docs: man:glusterd(8) Process: 12734 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, s Main PID: 12735 (glusterd) CGroup: /system.slice/glusterd.service ├─12735 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─12794 /usr/sbin/glusterfs -s localhost --volfile-id shd/gv0 -p /var/run/gluster/shd/gv0/gv0-shd.pid -l /var/log/gl Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: dlfcn 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: libpthread 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: llistxattr 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: setfsid 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: spinlock 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: epoll.h 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: xattr.h 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: st_atim.tv_nsec 1 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: package-string: glusterfs 7.0 Dec 19 13:32:45 hc2-1 data-brick1-gv0[12772]: --------- root@hc2-1:~# root@hc2-9:~# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick hc2-1:/data/brick1/gv0 N/A N/A N N/A Brick hc2-2:/data/brick1/gv0 49152 0 Y 1322 Brick hc2-5:/data/brick1/gv0 49152 0 Y 1767 Brick hc2-3:/data/brick1/gv0 49152 0 Y 1474 Brick hc2-4:/data/brick1/gv0 49152 0 Y 1472 Brick hc2-5:/data/brick2/gv0 49153 0 Y 1787 Self-heal Daemon on localhost N/A N/A Y 1314 Self-heal Daemon on hc2-5 N/A N/A Y 1808 Self-heal Daemon on hc2-3 N/A N/A Y 1485 Self-heal Daemon on hc2-4 N/A N/A Y 1486 Self-heal Daemon on hc2-1 N/A N/A Y 13522 Self-heal Daemon on hc2-2 N/A N/A Y 1348 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks root@hc2-9:~# gluster volume heal gv0 info summary Brick hc2-1:/data/brick1/gv0 Status: Transport endpoint is not connected Total Number of entries: - Number of entries in heal pending: - Number of entries in split-brain: - Number of entries possibly healing: - Brick hc2-2:/data/brick1/gv0 Status: Connected Total Number of entries: 977 Number of entries in heal pending: 977 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick hc2-5:/data/brick1/gv0 Status: Connected Total Number of entries: 977 Number of entries in heal pending: 977 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick hc2-3:/data/brick1/gv0 Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick hc2-4:/data/brick1/gv0 Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick hc2-5:/data/brick2/gv0 Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 root@hc2-9:~# gluster volume info Volume Name: gv0 Type: Distributed-Replicate Volume ID: 9fcb6792-3899-4802-828f-84f37c026881 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: hc2-1:/data/brick1/gv0 Brick2: hc2-2:/data/brick1/gv0 Brick3: hc2-5:/data/brick1/gv0 (arbiter) Brick4: hc2-3:/data/brick1/gv0 Brick5: hc2-4:/data/brick1/gv0 Brick6: hc2-5:/data/brick2/gv0 (arbiter) Options Reconfigured: performance.client-io-threads: off nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet
Currently I can't test it on an ARM machine. Is it possible for you to open the coredump with gdb with symbols loaded and run this command to get some information about the reason of the crash ? (gdb) t a a bt
I can open the coredump with gdb but where do I find the symbols file? gdb /usr/sbin/glusterfs /core . . . Reading symbols from /usr/sbin/glusterfs...(no debugging symbols found)...done.
Created attachment 1646676 [details] gdb glusterfsd core
After "apt install glusterfs-dbg" I was able to load the symbols file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/.build-id/31/453c4877ad5c7f1a2553147feb1c0816f67654.debug...done. See attachment 1646676 [details].
You will also need to install debug symbols for libc because it doesn't seem able to correctly decode the backtraces inside that library.
Created attachment 1646687 [details] gdb glusterfsd core Installed libc6-dbg now.
Thanks for all the information. I think I've identified the issue. I've uploaded a patch [1] to solve this. If you can review and/or test it, it would be great. Once merged in master branch, I'll backport it to release 7 branch so that it can be available on next release. [1] https://review.gluster.org/c/glusterfs/+/23912
Thanks for the patch. I have never installed Gluster from source but will have a go at it.
Just adding that I have the same issue - I was able to compile glusterfs with the included patch, but I couldn't get the peers to probe with the new glusterfs version, so couldn't test. Using glusterfs 6.6 right now.
Created attachment 1647181 [details] gdb glusterfs core Was able to compile from source with the patch. Now the bricks stay up but the Self-heal Daemon is crashing. Node hc2-9 is still version 7.0 all other node are version 8dev with patch. root@hc2-1:~# gluster --version glusterfs 8dev root@hc2-9:~# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick hc2-1:/data/brick1/gv0 49152 0 Y 1564 Brick hc2-2:/data/brick1/gv0 49152 0 Y 1909 Brick hc2-5:/data/brick1/gv0 49152 0 Y 2067 Brick hc2-3:/data/brick1/gv0 49152 0 Y 1780 Brick hc2-4:/data/brick1/gv0 49152 0 Y 2071 Brick hc2-5:/data/brick2/gv0 49153 0 Y 2074 Self-heal Daemon on localhost N/A N/A Y 1312 Self-heal Daemon on hc2-5 N/A N/A N N/A Self-heal Daemon on hc2-4 N/A N/A N N/A Self-heal Daemon on hc2-3 N/A N/A N N/A Self-heal Daemon on hc2-2 N/A N/A N N/A Self-heal Daemon on hc2-1 N/A N/A N N/A Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks
Good to know that one problem has been fixed. Can you provide the backtrace of self-heal daemon ? it might be a similar problem in another place.
Have you looked at attachment 1647181 [details]? Do you need more info?
Thanks Robin. I think it will be enough. It seems the same problem but on another place. Now I need to identify where it's exactly, but your backtrace seems enough to know where to start.
Are you using IPv6 ? Can you check the output of following commands from gdb ? (gdb) thread 1 (gdb) frame 7 (gdb) print host (gdb) print sin_family (gdb) print server
Both IPv4 and IPv6 are enabled. DNS resolution is done via /etc/hosts files with only IPv4 addresses. Below info is from a new trace file so may not be the info you asked for. (gdb) thread 1 [Switching to thread 1 (Thread 0xb20fe700 (LWP 2613))] #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 47 in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S (gdb) frame 7 #7 0xb6f4ac52 in _gf_event (event=event@entry=EVENT_AFR_SUBVOLS_DOWN, fmt=0xb226fbcc "client-pid=%d; subvol=%s") at events.c:151 151 } (gdb) print host $1 = 0xb1705d30 "" (gdb) print sin_family $2 = <optimized out> (gdb) print server $3 = {sin_family = 10, sin_port = 51549, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"} (gdb)
Created attachment 1651993 [details] gdb glusterfsd core
Apparently this new crash is fundamentally different than what caused this bug, so I opened a new bug #1790870 to track this new problem. I'll upload a fix soon.
I have tested after patch 24014 was merged and glusterd does not start from systemd. (I don't mind for the moment and will look into that later). When started from the command line all is well, no more crashes. Thanks for your support and patches.
The patch shouldn't change anything to prevent glusted from starting. However, if you have compiled from source, it's possible that the install patch is /usr/local/sbin instead of /usr/sbin. Maybe the the systemd unit is still looking at /usr/sbin ? Most probably, when this is packaged using regular channels, this issue should disappear. Given it seems to be working, may I close this bug ? (if the systemd problem persists, you can open another bug to analyze it).
Yes, no problem if you close this bug. I plan to re-install when Gluster 8.0 will be released, or do you back-port the patches to a 7.x version?
Sorry. I forgot to backport it. I'll do so before closing the bug.
REVIEW: https://review.gluster.org/24207 (multiple: fix bad type cast) posted (#1) for review on release-7 by Xavi Hernandez
This bug is moved to https://github.com/gluster/glusterfs/issues/1063, and will be tracked there from now on. Visit GitHub issues URL for further details
REVIEW: https://review.gluster.org/24207 (multiple: fix bad type cast) merged (#2) on release-7 by Rinku Kothiya