1508817 – [Ganesha] : Ganesha crashed while restarting Ganesha post vol stop/deletes in loop.

Bug 1508817 - [Ganesha] : Ganesha crashed while restarting Ganesha post vol stop/deletes in loop.

Summary: [Ganesha] : Ganesha crashed while restarting Ganesha post vol stop/deletes in...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	libgfapi
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Soumya Koduri
QA Contact:	Vivek Das
Docs Contact:
URL:
Whiteboard:
Depends On:	1350191 1509189 1565590 1568373 1568374
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-02 09:49 UTC by Ambarish
Modified:	2018-11-09 14:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-09 14:49:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ambarish 2017-11-02 09:49:59 UTC

Description of problem:
-----------------------

Created 50 volumes.
Exported them via Ganesha.
Stopped them,deleted them - both in loop.
Restarted Ganesha service.

Ganesha dumped a core on one of the nodes :

Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d80f1009d in __gf_free (free_ptr=free_ptr@entry=0x7f0a200caaa0) at mem-pool.c:314
314	        GF_ASSERT (GF_MEM_TRAILER_MAGIC ==
(gdb) bt
#0  0x00007f0d80f1009d in __gf_free (free_ptr=free_ptr@entry=0x7f0a200caaa0) at mem-pool.c:314
#1  0x00007f0d80ef6f05 in gf_timer_proc (data=0x5618681c5640) at timer.c:168
#2  0x00007f0d84551e25 in start_thread (arg=0x7f099773a700) at pthread_create.c:308
#3  0x00007f0d83c1f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 



Version-Release number of selected component (if applicable):
--------------------------------------------------------------

glusterfs-ganesha-3.8.4-50.el7rhgs.x86_64
nfs-ganesha-2.4.4-17.el7rhgs.x86_64


How reproducible:
----------------

1/1

Comment 4 Daniel Gryniewicz 2017-11-02 15:03:19 UTC

It looks like maybe there can be a race beween gf_time_proc(), gf_timer_registry_destroy(), and gf_timer_call_cancel().  Here's what I see.

gf_timer_proc() is called, locks reg, and gets an event.  It unlocks reg, and calls the callback.

Now, gf_timer_registry_destroy() is called, and removes reg from ctx, and joins on gf_timer_proc().

Now, gf_timer_call_cancel() is called on the event being processed.  It cannot find reg (since it's been removed from reg), so it frees event.

Now the callback returns into gf_timer_proc(), and it tries to free event, but it's already free, so double free.

I don't know if this is actually possible, since I didn't track down all the places that cancel events, but it seems that it could result in the backtrace here.

Comment 5 Soumya Koduri 2017-11-03 03:14:44 UTC

Thats right Dan. I too suspected this race may have happened.. But then gf_timer_call_cancel() would have logged the error message when it cannot find reg. In the gfapi.log provided on Ambarish setup, I couldn't find such message. Thats when I realized probably there is another race lurking and the code differs from current upstream to downstream bits.

Neverthless, this is one race which we need to fix in the upstream as well.

Comment 6 Soumya Koduri 2017-11-03 10:44:25 UTC

Filed bug1509189 to address the above issue mentioned by Dan. Though its not clear yet which race exactly was hit, we need to backport/cherry-pick both these fixes for sure -

https://review.gluster.org/#/c/14800/ (bug1350191)
https://review.gluster.org/18652 (bug1509189)

Note You need to log in before you can comment on or make changes to this bug.