1465861 – Removal of io threads from graph causes segfault in quota enable volume

Bug 1465861 - Removal of io threads from graph causes segfault in quota enable volume

Summary: Removal of io threads from graph causes segfault in quota enable volume

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	quota
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-28 11:43 UTC by Sanoj Unnikrishnan
Modified:	2018-11-21 03:15 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-11-21 03:15:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sanoj Unnikrishnan 2017-06-28 11:43:34 UTC

If IO threads is removed the graph on a quota enabled volume.
The brick process segfaults on writes.

steps to reproduce.
1) create volume and enable quota on it.
2) stop the volume, remove io threads in graph and restart the volume
3) mount the volume and do IO. 
4) brick process would crash..

The below mail thread has related discussion:
http://www.spinics.net/lists/gluster-devel/msg23111.html


The issue is that marker allocates only 16k stack space for the syncop threads and this seems to be insufficient.

Comment 1 Sanoj Unnikrishnan 2017-06-28 11:44:44 UTC


In attempts to repro this , I found that on each run some random structures where getting corrupted and running into segfault.
In order to assert that the stack was indeed growing into all the allocated space and beyond, I set a guard page in the end of the allocated stack space (so that we hit a segfault before overusing the space).
Below are the code changes.

@@ -443,6 +443,8 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn,
         struct synctask *newtask = NULL;
         xlator_t        *this    = THIS;
         int             destroymode = 0;
+        int                     r=0;
+        char                    *v;
 
         VALIDATE_OR_GOTO (env, err);
         VALIDATE_OR_GOTO (fn, err);
@@ -498,9 +500,15 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = env->stacksize;
         } else {
-                newtask->stack = GF_CALLOC (1, stacksize,
+               newtask->stack = GF_CALLOC (1, stacksize,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = stacksize;
+                if (stacksize == 16*1024) {
+                        v = (unsigned long)((char *)(newtask->stack) + 4095) & (~4095);
+                        r = mprotect(v, 4096, PROT_NONE);
+                       gf_msg ("syncop", GF_LOG_ERROR, errno,
+                                LG_MSG_GETCONTEXT_FAILED, "SKU: using 16k stack starting at %p, mprotect returned %d, guard page: %p", newtask->stack, r, v);
+               }
         }
 
(gdb) where
#0  0x00007f8a92c51204 in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#1  0x00007f8a92c561e3 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#2  0x00007f8a92c5dd33 in _dl_runtime_resolve_avx () from /lib64/ld-linux-x86-64.so.2
#3  0x0000000000000000 in ?? ()


(gdb) info reg

rdi            0x7f8a92946188	140233141412232
rbp            0x7f8a800b4000	0x7f8a800b4000
rsp            0x7f8a800b4000	0x7f8a800b4000
r8             0x7f8a92e4ba60	140233146677856

(gdb) layout asm

  >│0x7f8a92c51204 <_dl_lookup_symbol_x+4>          push   %r15                   <== push on stack at the guarded page caused segfault

From the brick log we have,

[syncop.c:515:synctask_create] 0-syncop: SKU: using 16k stack starting at 0x7f8a800b28f0, mprotect returned 0, guard page: 0x7f8a800b3000 [No data available]

Stack grows downward from 0x7f8a800b68f0 to 0x7f8a800b28f0  and the page 0x7f8a800b3000 - 0x7f8a800b4000 is guarded , which is where the segfault hit as seen in gdb.

This confirms that the stack space is not sufficient and overflowing, 
I am not sure why we don't hit this in the presence of IO threads though, It may just be that with io threads in graph we may have some allocated and unused memory which our stack freely grows into.
It may just be a silent undetected reuse of some memory.

Comment 4 hari gowtham 2018-11-21 03:15:58 UTC

As quota is not being actively developed, we are closing this bug.

Note You need to log in before you can comment on or make changes to this bug.