Bug 1465861 - Removal of io threads from graph causes segfault in quota enable volume
Removal of io threads from graph causes segfault in quota enable volume
Status: ASSIGNED
Product: GlusterFS
Classification: Community
Component: quota (Show other bugs)
mainline
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Sanoj Unnikrishnan
Rahul Hinduja
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-28 07:43 EDT by Sanoj Unnikrishnan
Modified: 2018-02-06 17:34 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sanoj Unnikrishnan 2017-06-28 07:43:34 EDT
If IO threads is removed the graph on a quota enabled volume.
The brick process segfaults on writes.

steps to reproduce.
1) create volume and enable quota on it.
2) stop the volume, remove io threads in graph and restart the volume
3) mount the volume and do IO. 
4) brick process would crash..

The below mail thread has related discussion:
http://www.spinics.net/lists/gluster-devel/msg23111.html


The issue is that marker allocates only 16k stack space for the syncop threads and this seems to be insufficient.
Comment 1 Sanoj Unnikrishnan 2017-06-28 07:44:44 EDT

In attempts to repro this , I found that on each run some random structures where getting corrupted and running into segfault.
In order to assert that the stack was indeed growing into all the allocated space and beyond, I set a guard page in the end of the allocated stack space (so that we hit a segfault before overusing the space).
Below are the code changes.

@@ -443,6 +443,8 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn,
         struct synctask *newtask = NULL;
         xlator_t        *this    = THIS;
         int             destroymode = 0;
+        int                     r=0;
+        char                    *v;
 
         VALIDATE_OR_GOTO (env, err);
         VALIDATE_OR_GOTO (fn, err);
@@ -498,9 +500,15 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = env->stacksize;
         } else {
-                newtask->stack = GF_CALLOC (1, stacksize,
+               newtask->stack = GF_CALLOC (1, stacksize,
                                             gf_common_mt_syncstack);
                 newtask->ctx.uc_stack.ss_size = stacksize;
+                if (stacksize == 16*1024) {
+                        v = (unsigned long)((char *)(newtask->stack) + 4095) & (~4095);
+                        r = mprotect(v, 4096, PROT_NONE);
+                       gf_msg ("syncop", GF_LOG_ERROR, errno,
+                                LG_MSG_GETCONTEXT_FAILED, "SKU: using 16k stack starting at %p, mprotect returned %d, guard page: %p", newtask->stack, r, v);
+               }
         }
 
(gdb) where
#0  0x00007f8a92c51204 in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#1  0x00007f8a92c561e3 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#2  0x00007f8a92c5dd33 in _dl_runtime_resolve_avx () from /lib64/ld-linux-x86-64.so.2
#3  0x0000000000000000 in ?? ()


(gdb) info reg

rdi            0x7f8a92946188	140233141412232
rbp            0x7f8a800b4000	0x7f8a800b4000
rsp            0x7f8a800b4000	0x7f8a800b4000
r8             0x7f8a92e4ba60	140233146677856

(gdb) layout asm

  >│0x7f8a92c51204 <_dl_lookup_symbol_x+4>          push   %r15                   <== push on stack at the guarded page caused segfault

From the brick log we have,

[syncop.c:515:synctask_create] 0-syncop: SKU: using 16k stack starting at 0x7f8a800b28f0, mprotect returned 0, guard page: 0x7f8a800b3000 [No data available]

Stack grows downward from 0x7f8a800b68f0 to 0x7f8a800b28f0  and the page 0x7f8a800b3000 - 0x7f8a800b4000 is guarded , which is where the segfault hit as seen in gdb.

This confirms that the stack space is not sufficient and overflowing, 
I am not sure why we don't hit this in the presence of IO threads though, It may just be that with io threads in graph we may have some allocated and unused memory which our stack freely grows into.
It may just be a silent undetected reuse of some memory.

Note You need to log in before you can comment on or make changes to this bug.