Bug 1705351

Summary:	glusterfsd crash after days of running
Product:	[Community] GlusterFS	Reporter:	waza123
Component:	HDFS	Assignee:	bugs <bugs>
Status:	CLOSED EOL	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	medium
Version:	mainline	CC:	atumball, bugs, jahernan, nchilaka, pasik
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-15 05:16:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description waza123 2019-05-02 07:10:11 UTC

One of brick just crashed glusterfsd and it cant be started again
What can I do to start it again ?

crash dump gdb:




Program terminated with signal SIGSEGV, Segmentation fault.
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
239 local = upcall_local_init (frame, this, NULL, NULL, fd->inode, NULL);
[Current thread is 1 (Thread 0x7feb0031e700 (LWP 12319))]
(gdb) bt
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
#1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0, this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at defaults.c:1833
#2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508
#3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222
#4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at pthread_create.c:333
#5 0x00007feb3d60041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) bt full
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
op_errno = -1
local = 0x0
__FUNCTION__ = "up_lk" 
#1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0, this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at defaults.c:1833
_new = 0x7fea88193f30
old_THIS = 0x7feb3401e060
tmp_cbk = 0x7feb3e1bafa0 <default_lk_cbk>
__FUNCTION__ = "default_lk_resume" 
#2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508
old_THIS = 0x7feb3401e060
__FUNCTION__ = "call_resume" 
#3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222
conf = 0x7feb34058480
this = <optimized out>
stub = 0x7feb0d174bf0
sleep_till = {tv_sec = 1556637893, tv_nsec = 0}
ret = <optimized out>
pri = 1
bye = _gf_false
__FUNCTION__ = "iot_worker" 
#4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7feb0031e700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647297312512, 5756482990956014801, 0, 140648089937359, 140647297313216, 140648166818944, -5749651260269466415,
-5749590536105693999}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread" 
#5 0x00007feb3d60041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)





# config

# gluster volume info

Volume Name: hadoop_volume
Type: Disperse
Volume ID: f13b43b0-ff9e-429b-81ed-15c92cdd1181
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: hdd1:/hadoop
Brick2: hdd2:/hadoop
Brick3: hdd3:/hadoop
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
server.statedump-path: /tmp
performance.client-io-threads: on
server.event-threads: 16
client.event-threads: 16
cluster.lookup-optimize: on
performance.parallel-readdir: on
transport.address-family: inet
nfs.disable: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 500000
features.lock-heal: on



# status

# gluster volume status
Status of volume: hadoop_volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hdd1:/hadoop                          49152     0          Y       5085
Brick hdd2:/hadoop                          49152     0          Y       4044
Self-heal Daemon on localhost               N/A       N/A        Y       2383
Self-heal Daemon on serv3                   N/A       N/A        Y       2423
Self-heal Daemon on serv2                   N/A       N/A        Y       3429
Self-heal Daemon on hdd2                    N/A       N/A        Y       4035
Self-heal Daemon on hdd1                    N/A       N/A        Y       5076

Task Status of Volume hadoop_volume
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 Xavi Hernandez 2019-05-02 12:13:45 UTC

Can you upload the coredump to be able to analyze it ?
I will also need to know the exact version of gluster and the operating system you are using.

To restart the crashed brick, the following command should help:

   # gluster volume start hadoop_volume force

Comment 2 waza123 2019-05-03 09:51:42 UTC

https://drive.google.com/file/d/1n2IeRNqwXYmF1q664Rvtr5RuDu5taDz9/view?usp=sharing

Comment 3 Xavi Hernandez 2019-05-06 07:07:21 UTC

Thanks for the sharing the coredump. I'll take a look as soon as I can.

Comment 5 Xavi Hernandez 2019-06-04 10:54:31 UTC

Sorry for the late answer. I've checked the core dump and it seems to belong to a glusterfs 3.10.10. This is a very old version and it's already EOL. Is it possible to upgrade to a newer supported version and check if it works ?

At first sight I don't see a similar bug, but many things have changed since then.

If you are unable to upgrade, let me know which version of operating system are you using and which source you use to install gluster packages so that I can find the appropriate symbols to analyze the core.

Comment 6 Amar Tumballi 2019-07-15 05:16:10 UTC

Hi waza123,

Did you get a chance to upgrade? We have fixed many issues post 3.10.10 version, and we are already at 6.3 version, and about to make a glusterfs-7.0 version. I will be closing the issue as EOL (as the given version is no more supported). Please re-open if the issue persists on higher versions.

Comment 7 Red Hat Bugzilla 2023-09-14 05:27:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days