1796609 – Random glusterfsd crashes

Bug 1796609 - Random glusterfsd crashes

Summary: Random glusterfsd crashes

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-30 18:43 UTC by gagnon.pierluc
Modified:	2020-03-12 12:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:20:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Crash log (11.44 MB, text/plain) 2020-01-30 18:43 UTC, gagnon.pierluc	no flags	Details
gdb thread apply all bt (15.05 KB, text/plain) 2020-02-19 14:29 UTC, gagnon.pierluc	no flags	Details
gdb core dump analysis (1.70 KB, text/plain) 2020-02-19 14:30 UTC, gagnon.pierluc	no flags	Details
Thread apply all bt (with gluster-debug) (24.66 KB, text/plain) 2020-03-09 13:01 UTC, gagnon.pierluc	no flags	Details
GDB attach output (with gluster-debug) (1.91 KB, text/plain) 2020-03-09 13:02 UTC, gagnon.pierluc	no flags	Details
Show Obsolete (3) View All

Description gagnon.pierluc 2020-01-30 18:43:46 UTC

Created attachment 1656556 [details]
Crash log

Description of problem:
The gluster volume becomes inacessible (Transport endpoint not connected), which seems to be due to glusterfsd crashing.

Version-Release number of selected component (if applicable):
7.2

How reproducible:
Happens regularly (about once a day), but I cannot figure out how to trigger it.


Additional info:

apport dump: https://drive.google.com/open?id=1zElM6I6HNE7V_WU_SQH5-emlPpdcRd6e

mnt-{volume}.log attached (my volume is called 'gfs')

My cluster is composed of 3 nodes:
mars: 192.168.4.132
venus: 192.168.5.196
saturn: 192.168.4.146

Each node has 2 bricks, with a replica set to 3 (so 2 x 3).

All bricks are on xfs, except for the 2 bricks on mars which are on a single zfs volume (crash does not happen only on mars)

Extra:

~> sudo gluster peer status
Number of Peers: 2

Hostname: mars
Uuid: 53e473df-d8e9-4d0d-b753-ccfff5c5097c
State: Peer in Cluster (Connected)

Hostname: venus.sarbakaninc.local
Uuid: 4aa987f2-924b-4a2c-b441-ff1b0b1cbb86
State: Peer in Cluster (Connected)
Other names:
venus.sarbakaninc.local
venus


~> sudo gluster volume info
 
Volume Name: gfs
Type: Distributed-Replicate
Volume ID: 3f451b61-e48b-4be4-92ed-e509271d0284
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: saturn:/gluster/bricks/1/brick
Brick2: venus:/gluster/bricks/2/brick
Brick3: mars:/gluster/bricks/3/brick
Brick4: venus:/gluster/bricks/5/brick
Brick5: saturn:/gluster/bricks/6/brick
Brick6: mars:/gluster/bricks/4/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
server.event-threads: 4
changelog.changelog: on
geo-replication.ignore-pid-check: off
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
auth.allow: 192.168.5.222,192.168.5.196,192.168.4.132,192.168.4.133,192.168.5.195,192.168.4.146,192.168.5.55
performance.cache-size: 1GB
cluster.enable-shared-storage: disable



~> sudo gluster volume status
Status of volume: gfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick saturn:/gluster/bricks/1/brick        49152     0          Y       21533
Brick venus:/gluster/bricks/2/brick         49152     0          Y       4590 
Brick mars:/gluster/bricks/3/brick          49154     0          Y       30419
Brick venus:/gluster/bricks/5/brick         49153     0          Y       4591 
Brick saturn:/gluster/bricks/6/brick        49153     0          Y       21534
Brick mars:/gluster/bricks/4/brick          49155     0          Y       30447
Self-heal Daemon on localhost               N/A       N/A        Y       21564
Self-heal Daemon on venus.sarbakaninc.local N/A       N/A        Y       4610 
Self-heal Daemon on mars                    N/A       N/A        Y       3640 
 
Task Status of Volume gfs
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 gagnon.pierluc 2020-02-07 16:24:12 UTC

I have not seen this bug reoccur since I removed the ZFS bricks and replaced them with XFS bricks (so that all bricks are on XFS).

Comment 2 sankarshan 2020-02-10 01:53:34 UTC

(In reply to gagnon.pierluc from comment #1)
> I have not seen this bug reoccur since I removed the ZFS bricks and replaced
> them with XFS bricks (so that all bricks are on XFS).

Thank you for the update. I'll recommend that the maintainer/assignee close this report and it can be reopened if we see this happen again. To my knowledge, there is no specific focus on testing for the ZFS based underlying system and it is likely that this is a topic which needs close attention if we are to make the ZFS experience better.

Comment 3 gagnon.pierluc 2020-02-11 14:36:56 UTC

Sounds fair to me.

I'd rather have Gluster not crash, obviously, but at the very least this might provide insight to others having a similar issue.

Comment 4 Ravishankar N 2020-02-18 06:54:10 UTC

Closing based on comment#2 and 3. Please feel free to re-open if crash occurs with XFS.

Comment 5 gagnon.pierluc 2020-02-18 14:25:21 UTC

In a weird coincidence, the issue has re-occurred today. Re-opening.

(For the record this has re-occurred with all bricks on XFS)

Comment 6 Ravishankar N 2020-02-19 04:47:32 UTC

Can you attach gdb to the core file and share what it prints?
#gdb /usr/local/sbin-or-whererever-it-is-installed/glusterfs /path/to/core.file


Also share the backtrace of all the threads in the core:
(gdb) thread apply all bt

Also share the core file and the `uname -a` output of the machine if possible.

Comment 7 gagnon.pierluc 2020-02-19 14:29:50 UTC

Created attachment 1664054 [details]
gdb thread apply all bt

Comment 8 gagnon.pierluc 2020-02-19 14:30:49 UTC

Created attachment 1664055 [details]
gdb core dump analysis

I've also tried with the glusterfsd binary since I was getting no symbols, with a similar result.

Comment 9 gagnon.pierluc 2020-02-19 14:34:14 UTC

uname output:
  Linux mars 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

(sorry about the noise, I did not know each attachment would create a comment)

Core dump: https://drive.google.com/file/d/18g3FhIYj5BpvUUoJgDYyN2-KnBuWYOry/view?usp=sharing (let me know if you prefer another way to share the file)

Comment 10 Mohit Agrawal 2020-02-19 14:36:02 UTC

Kindly try to attach a core with gdb after install the glusterfs-debug package.

Comment 11 gagnon.pierluc 2020-03-09 13:01:05 UTC

My apologies for the delay, an intermittent bug is hard to catch! Here's a CoreDump with gluster-debug installed: https://drive.google.com/open?id=1PcszgKX2AL-MH_U2gMLbFO4GKnVnvpPM

I'll attach the requested information from gdb seperately.

Comment 12 gagnon.pierluc 2020-03-09 13:01:37 UTC

Created attachment 1668657 [details]
Thread apply all bt (with gluster-debug)

Comment 13 gagnon.pierluc 2020-03-09 13:02:21 UTC

Created attachment 1668658 [details]
GDB attach output (with gluster-debug)

Comment 14 Worker Ant 2020-03-12 12:20:43 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/875, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.