Created attachment 1656556 [details] Crash log Description of problem: The gluster volume becomes inacessible (Transport endpoint not connected), which seems to be due to glusterfsd crashing. Version-Release number of selected component (if applicable): 7.2 How reproducible: Happens regularly (about once a day), but I cannot figure out how to trigger it. Additional info: apport dump: https://drive.google.com/open?id=1zElM6I6HNE7V_WU_SQH5-emlPpdcRd6e mnt-{volume}.log attached (my volume is called 'gfs') My cluster is composed of 3 nodes: mars: 192.168.4.132 venus: 192.168.5.196 saturn: 192.168.4.146 Each node has 2 bricks, with a replica set to 3 (so 2 x 3). All bricks are on xfs, except for the 2 bricks on mars which are on a single zfs volume (crash does not happen only on mars) Extra: ~> sudo gluster peer status Number of Peers: 2 Hostname: mars Uuid: 53e473df-d8e9-4d0d-b753-ccfff5c5097c State: Peer in Cluster (Connected) Hostname: venus.sarbakaninc.local Uuid: 4aa987f2-924b-4a2c-b441-ff1b0b1cbb86 State: Peer in Cluster (Connected) Other names: venus.sarbakaninc.local venus ~> sudo gluster volume info Volume Name: gfs Type: Distributed-Replicate Volume ID: 3f451b61-e48b-4be4-92ed-e509271d0284 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: saturn:/gluster/bricks/1/brick Brick2: venus:/gluster/bricks/2/brick Brick3: mars:/gluster/bricks/3/brick Brick4: venus:/gluster/bricks/5/brick Brick5: saturn:/gluster/bricks/6/brick Brick6: mars:/gluster/bricks/4/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on server.event-threads: 4 changelog.changelog: on geo-replication.ignore-pid-check: off performance.client-io-threads: on nfs.disable: on transport.address-family: inet auth.allow: 192.168.5.222,192.168.5.196,192.168.4.132,192.168.4.133,192.168.5.195,192.168.4.146,192.168.5.55 performance.cache-size: 1GB cluster.enable-shared-storage: disable ~> sudo gluster volume status Status of volume: gfs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick saturn:/gluster/bricks/1/brick 49152 0 Y 21533 Brick venus:/gluster/bricks/2/brick 49152 0 Y 4590 Brick mars:/gluster/bricks/3/brick 49154 0 Y 30419 Brick venus:/gluster/bricks/5/brick 49153 0 Y 4591 Brick saturn:/gluster/bricks/6/brick 49153 0 Y 21534 Brick mars:/gluster/bricks/4/brick 49155 0 Y 30447 Self-heal Daemon on localhost N/A N/A Y 21564 Self-heal Daemon on venus.sarbakaninc.local N/A N/A Y 4610 Self-heal Daemon on mars N/A N/A Y 3640 Task Status of Volume gfs ------------------------------------------------------------------------------ There are no active volume tasks
I have not seen this bug reoccur since I removed the ZFS bricks and replaced them with XFS bricks (so that all bricks are on XFS).
(In reply to gagnon.pierluc from comment #1) > I have not seen this bug reoccur since I removed the ZFS bricks and replaced > them with XFS bricks (so that all bricks are on XFS). Thank you for the update. I'll recommend that the maintainer/assignee close this report and it can be reopened if we see this happen again. To my knowledge, there is no specific focus on testing for the ZFS based underlying system and it is likely that this is a topic which needs close attention if we are to make the ZFS experience better.
Sounds fair to me. I'd rather have Gluster not crash, obviously, but at the very least this might provide insight to others having a similar issue.
Closing based on comment#2 and 3. Please feel free to re-open if crash occurs with XFS.
In a weird coincidence, the issue has re-occurred today. Re-opening. (For the record this has re-occurred with all bricks on XFS)
Can you attach gdb to the core file and share what it prints? #gdb /usr/local/sbin-or-whererever-it-is-installed/glusterfs /path/to/core.file Also share the backtrace of all the threads in the core: (gdb) thread apply all bt Also share the core file and the `uname -a` output of the machine if possible.
Created attachment 1664054 [details] gdb thread apply all bt
Created attachment 1664055 [details] gdb core dump analysis I've also tried with the glusterfsd binary since I was getting no symbols, with a similar result.
uname output: Linux mars 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (sorry about the noise, I did not know each attachment would create a comment) Core dump: https://drive.google.com/file/d/18g3FhIYj5BpvUUoJgDYyN2-KnBuWYOry/view?usp=sharing (let me know if you prefer another way to share the file)
Kindly try to attach a core with gdb after install the glusterfs-debug package.
My apologies for the delay, an intermittent bug is hard to catch! Here's a CoreDump with gluster-debug installed: https://drive.google.com/open?id=1PcszgKX2AL-MH_U2gMLbFO4GKnVnvpPM I'll attach the requested information from gdb seperately.
Created attachment 1668657 [details] Thread apply all bt (with gluster-debug)
Created attachment 1668658 [details] GDB attach output (with gluster-debug)
This bug is moved to https://github.com/gluster/glusterfs/issues/875, and will be tracked there from now on. Visit GitHub issues URL for further details