Bug 1508817

Summary:	[Ganesha] : Ganesha crashed while restarting Ganesha post vol stop/deletes in loop.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	libgfapi	Assignee:	Soumya Koduri <skoduri>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Vivek Das <vdas>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-09 14:49:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1350191, 1509189, 1565590, 1568373, 1568374
Bug Blocks:

Description Ambarish 2017-11-02 09:49:59 UTC

Description of problem:
-----------------------

Created 50 volumes.
Exported them via Ganesha.
Stopped them,deleted them - both in loop.
Restarted Ganesha service.

Ganesha dumped a core on one of the nodes :

Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d80f1009d in __gf_free (free_ptr=free_ptr@entry=0x7f0a200caaa0) at mem-pool.c:314
314	        GF_ASSERT (GF_MEM_TRAILER_MAGIC ==
(gdb) bt
#0  0x00007f0d80f1009d in __gf_free (free_ptr=free_ptr@entry=0x7f0a200caaa0) at mem-pool.c:314
#1  0x00007f0d80ef6f05 in gf_timer_proc (data=0x5618681c5640) at timer.c:168
#2  0x00007f0d84551e25 in start_thread (arg=0x7f099773a700) at pthread_create.c:308
#3  0x00007f0d83c1f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 



Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.8.4-50.el7rhgs.x86_64
nfs-ganesha-2.4.4-17.el7rhgs.x86_64


How reproducible:
----------------

1/1

Comment 4 Daniel Gryniewicz 2017-11-02 15:03:19 UTC

It looks like maybe there can be a race beween gf_time_proc(), gf_timer_registry_destroy(), and gf_timer_call_cancel().  Here's what I see.

gf_timer_proc() is called, locks reg, and gets an event.  It unlocks reg, and calls the callback.

Now, gf_timer_registry_destroy() is called, and removes reg from ctx, and joins on gf_timer_proc().

Now, gf_timer_call_cancel() is called on the event being processed.  It cannot find reg (since it's been removed from reg), so it frees event.

Now the callback returns into gf_timer_proc(), and it tries to free event, but it's already free, so double free.

I don't know if this is actually possible, since I didn't track down all the places that cancel events, but it seems that it could result in the backtrace here.

Comment 5 Soumya Koduri 2017-11-03 03:14:44 UTC

Thats right Dan. I too suspected this race may have happened.. But then gf_timer_call_cancel() would have logged the error message when it cannot find reg. In the gfapi.log provided on Ambarish setup, I couldn't find such message. Thats when I realized probably there is another race lurking and the code differs from current upstream to downstream bits.

Neverthless, this is one race which we need to fix in the upstream as well.

Comment 6 Soumya Koduri 2017-11-03 10:44:25 UTC

Filed bug1509189 to address the above issue mentioned by Dan. Though its not clear yet which race exactly was hit, we need to backport/cherry-pick both these fixes for sure -

https://review.gluster.org/#/c/14800/ (bug1350191)
https://review.gluster.org/18652 (bug1509189)