Bug 1508817
Summary: | [Ganesha] : Ganesha crashed while restarting Ganesha post vol stop/deletes in loop. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
Component: | libgfapi | Assignee: | Soumya Koduri <skoduri> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Vivek Das <vdas> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.3 | CC: | amukherj, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, rhinduja, rhs-bugs, skoduri, storage-qa-internal |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-11-09 14:49:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1350191, 1509189, 1565590, 1568373, 1568374 | ||
Bug Blocks: |
Description
Ambarish
2017-11-02 09:49:59 UTC
It looks like maybe there can be a race beween gf_time_proc(), gf_timer_registry_destroy(), and gf_timer_call_cancel(). Here's what I see. gf_timer_proc() is called, locks reg, and gets an event. It unlocks reg, and calls the callback. Now, gf_timer_registry_destroy() is called, and removes reg from ctx, and joins on gf_timer_proc(). Now, gf_timer_call_cancel() is called on the event being processed. It cannot find reg (since it's been removed from reg), so it frees event. Now the callback returns into gf_timer_proc(), and it tries to free event, but it's already free, so double free. I don't know if this is actually possible, since I didn't track down all the places that cancel events, but it seems that it could result in the backtrace here. Thats right Dan. I too suspected this race may have happened.. But then gf_timer_call_cancel() would have logged the error message when it cannot find reg. In the gfapi.log provided on Ambarish setup, I couldn't find such message. Thats when I realized probably there is another race lurking and the code differs from current upstream to downstream bits. Neverthless, this is one race which we need to fix in the upstream as well. Filed bug1509189 to address the above issue mentioned by Dan. Though its not clear yet which race exactly was hit, we need to backport/cherry-pick both these fixes for sure - https://review.gluster.org/#/c/14800/ (bug1350191) https://review.gluster.org/18652 (bug1509189) |