1159284 – Random crashes when generating an internal state dump with signal USR1

Bug 1159284 - Random crashes when generating an internal state dump with signal USR1

Summary: Random crashes when generating an internal state dump with signal USR1

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1159269
Blocks:	glusterfs-3.6.1 glusterfs-3.6.2
TreeView+	depends on / blocked

Reported:	2014-10-31 11:41 UTC by Xavi Hernandez
Modified:	2016-08-16 13:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1159269
Environment:
Last Closed:	2016-08-16 13:02:00 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Xavi Hernandez 2014-10-31 11:41:39 UTC

+++ This bug was initially created as a clone of Bug #1159269 +++

Description of problem:

Sometimes a segmentation fault is generated while dumping internal state. An analysis of the core dump seems to indicate that the bug is caused by an unaligned structure:

In gf_proc_dump_call_frame() a copy of the frame is made inside a locked region:

88              ret = TRY_LOCK(&call_frame->lock);
89              if (ret)
90                      goto out;
91
92              memcpy(&my_frame, call_frame, sizeof(my_frame));
93              UNLOCK(&call_frame->lock);

call_frame->lock does not protect most of the updates to fields inside the call_frame_t structure, specially the pointers to wind_from, wind_to, unwind_from and unwind_to modified in macros STACK_WIND and STACK_UNWIND.

This shouldn't be a problem if all these updates were atomic, however it seems that the memory pool framework can return unaligned pointers (at least on 64-bits architectures):

(gdb) print call_frame
$19 = (call_frame_t *) 0x7f4609a141c4

This means that all pointers inside the structure can be unaligned:

(gdb) print &call_frame->unwind_from
$20 = (const char **) 0x7f4609a14244

Translated to the microprocessor level, this means that a modification of the unwind_from field will need 2 memory access cycles making the update non atomic and prone to partial reads by other threads.

In fact this seems to be what happened:

(gdb) print *call_frame
$21 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0x0,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 1, cookie = 0x9, complete = _gf_true, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk",
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}
(gdb) print my_frame
$22 = {root = 0x7f460984a280, parent = 0x7f460984a8e8,
next = 0x7f4609a13454, prev = 0x7f4609a15540, local = 0xb6a0b4,
this = 0xae2470, ret = 0x7f45fec75311 <ec_lookup_cbk>, ref_count = 0,
lock = 0, cookie = 0x9, complete = _gf_false, op = GF_FOP_NULL,
begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0},
wind_from = 0x7f45fecdc082 <__FUNCTION__.13893> "ec_wind_lookup",
wind_to = 0x7f45fecdbd20 "ec->xl_list[idx]->fops->lookup",
unwind_from = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000>,
unwind_to = 0x7f45fecdbd3f "ec_lookup_cbk"}

The copy made to my_frame has only copied half of the unwind_from pointer because it was being updated in another thread. If we check current contents of call_frame, we can see that the pointer has completed to be updated before crashing, but the copy on my_frame remains incorrect:

(gdb) print call_frame->unwind_from
$23 = 0x7f45fef26c80 <__FUNCTION__.19453> "client3_3_lookup_cbk"
(gdb) print my_frame.unwind_from
$24 = 0x7f4500000000 <error: Cannot access memory at address 0x7f4500000000> 

Version-Release number of selected component (if applicable): master


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Anand Avati on 2014-10-31 12:33:40 CET ---

REVIEW: http://review.gluster.org/9031 (mem-pool: Fix memory block alignments) posted (#1) for review on master by Xavier Hernandez (xhernandez)

Comment 1 Anand Avati 2014-10-31 11:43:58 UTC

REVIEW: http://review.gluster.org/9032 (mem-pool: Fix memory block alignments) posted (#1) for review on release-3.6 by Xavier Hernandez (xhernandez)

Comment 2 Mike McCune 2016-03-28 23:46:11 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 3 Niels de Vos 2016-08-16 13:02:00 UTC

This bug is being closed as GlusterFS-3.6 is nearing its End-Of-Life and only important security bugs will be fixed. This bug has been fixed in more recent GlusterFS releases. If you still face this bug with the newer GlusterFS versions, please open a new bug.

Note You need to log in before you can comment on or make changes to this bug.