Bug 1159269 - Random crashes when generating an internal state dump with signal USR1
Summary: Random crashes when generating an internal state dump with signal USR1
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1159284
TreeView+ depends on / blocked
 
Reported: 2014-10-31 10:47 UTC by Xavi Hernandez
Modified: 2016-09-13 06:31 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1159284 (view as bug list)
Environment:
Last Closed: 2016-09-13 06:31:53 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Xavi Hernandez 2014-10-31 10:47:54 UTC
Description of problem:

Sometimes a segmentation fault is generated while dumping internal state. An analysis of the core dump seems to indicate that the bug is caused by an unaligned structure:

In gf_proc_dump_call_frame() a copy of the frame is made inside a locked region:

88              ret = TRY_LOCK(&call_frame->lock);
89              if (ret)
90                      goto out;
91
92              memcpy(&my_frame, call_frame, sizeof(my_frame));
93              UNLOCK(&call_frame->lock);

call_frame->lock does not protect most of the updates to fields inside the call_frame_t structure, specially the pointers to wind_from, wind_to, unwind_from and unwind_to modified in macros STACK_WIND and STACK_UNWIND.

This shouldn't be a problem if all these updates were atomic, however it seems that the memory pool framework can return unaligned pointers (at least on 64-bits architectures):

(gdb) print call_frame
$19 = (call_frame_t *) 0x7f4609a141c4

This means that all pointers inside the structure can be unaligned:

(gdb) print &call_frame->unwind_from
$20 = (const char **) 0x7f4609a14244

Translated to the microprocessor level, this means that a modification of the unwind_from field will need 2 memory access cycles making the update non atomic and prone to partial reads by other threads.

In fact this seems to be what happened:

(gdb) print *call_frame
$21 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0x0,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 1, cookie = 0x9, complete = _gf_true, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk",
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}
(gdb) print my_frame
$22 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0xb6a0b4,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 0, cookie = 0x9, complete = _gf_false, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000>,
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}

The copy made to my_frame has only copied half of the unwind_from pointer because it was being updated in another thread. If we check current contents of call_frame, we can see that the pointer has completed to be updated before crashing, but the copy on my_frame remains incorrect:

(gdb) print call_frame->unwind_from
$23 = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk"
(gdb) print my_frame.unwind_from
$24 = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000> 

Version-Release number of selected component (if applicable): master


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Anand Avati 2014-10-31 11:33:40 UTC
REVIEW: http://review.gluster.org/9031 (mem-pool: Fix memory block alignments) posted (#1) for review on master by Xavier Hernandez (xhernandez)

Comment 2 Mike McCune 2016-03-28 23:46:11 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 3 Xavi Hernandez 2016-09-13 06:31:53 UTC
It seems to not happen anymore.


Note You need to log in before you can comment on or make changes to this bug.