If IO threads is removed the graph on a quota enabled volume. The brick process segfaults on writes. steps to reproduce. 1) create volume and enable quota on it. 2) stop the volume, remove io threads in graph and restart the volume 3) mount the volume and do IO. 4) brick process would crash.. The below mail thread has related discussion: http://www.spinics.net/lists/gluster-devel/msg23111.html The issue is that marker allocates only 16k stack space for the syncop threads and this seems to be insufficient.
In attempts to repro this , I found that on each run some random structures where getting corrupted and running into segfault. In order to assert that the stack was indeed growing into all the allocated space and beyond, I set a guard page in the end of the allocated stack space (so that we hit a segfault before overusing the space). Below are the code changes. @@ -443,6 +443,8 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn, struct synctask *newtask = NULL; xlator_t *this = THIS; int destroymode = 0; + int r=0; + char *v; VALIDATE_OR_GOTO (env, err); VALIDATE_OR_GOTO (fn, err); @@ -498,9 +500,15 @@ synctask_create (struct syncenv *env, size_t stacksize, synctask_fn_t fn, gf_common_mt_syncstack); newtask->ctx.uc_stack.ss_size = env->stacksize; } else { - newtask->stack = GF_CALLOC (1, stacksize, + newtask->stack = GF_CALLOC (1, stacksize, gf_common_mt_syncstack); newtask->ctx.uc_stack.ss_size = stacksize; + if (stacksize == 16*1024) { + v = (unsigned long)((char *)(newtask->stack) + 4095) & (~4095); + r = mprotect(v, 4096, PROT_NONE); + gf_msg ("syncop", GF_LOG_ERROR, errno, + LG_MSG_GETCONTEXT_FAILED, "SKU: using 16k stack starting at %p, mprotect returned %d, guard page: %p", newtask->stack, r, v); + } } (gdb) where #0 0x00007f8a92c51204 in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2 #1 0x00007f8a92c561e3 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2 #2 0x00007f8a92c5dd33 in _dl_runtime_resolve_avx () from /lib64/ld-linux-x86-64.so.2 #3 0x0000000000000000 in ?? () (gdb) info reg rdi 0x7f8a92946188 140233141412232 rbp 0x7f8a800b4000 0x7f8a800b4000 rsp 0x7f8a800b4000 0x7f8a800b4000 r8 0x7f8a92e4ba60 140233146677856 (gdb) layout asm >│0x7f8a92c51204 <_dl_lookup_symbol_x+4> push %r15 <== push on stack at the guarded page caused segfault From the brick log we have, [syncop.c:515:synctask_create] 0-syncop: SKU: using 16k stack starting at 0x7f8a800b28f0, mprotect returned 0, guard page: 0x7f8a800b3000 [No data available] Stack grows downward from 0x7f8a800b68f0 to 0x7f8a800b28f0 and the page 0x7f8a800b3000 - 0x7f8a800b4000 is guarded , which is where the segfault hit as seen in gdb. This confirms that the stack space is not sufficient and overflowing, I am not sure why we don't hit this in the presence of IO threads though, It may just be that with io threads in graph we may have some allocated and unused memory which our stack freely grows into. It may just be a silent undetected reuse of some memory.
As quota is not being actively developed, we are closing this bug.