Bug 824646

Summary: sge_execd segfaults in ptmalloc_lock_all during fork
Product: [Fedora] Fedora Reporter: Orion Poplawski <orion>
Component: jemallocAssignee: Ingvar Hagelund <ingvar>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: ingvar
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-05 23:11:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Orion Poplawski 2012-05-23 21:36:23 UTC
Description of problem:

With jemalloc 3.0.0 I'm seeing segfaults in fork() in sge_execd on both i686 and x86_64.

If I run it under valgrind it runs fine, but no errors are reported.  This seems to confirm memory issues, but isn't helpful.

Temporary breakpoint 6, main (argc=1, argv=0x7fffffffe3f8) at ../daemons/execd/execd.c:150
150     {
(gdb) print __malloc_hook
$16 = (void *(* const)(size_t)) 0x321e605070 <malloc>

Program received signal SIGSEGV, Segmentation fault.
0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251
251       __malloc_hook = malloc_atfork;

(gdb) thread apply all bt

Thread 5 (Thread 0x7fffeefff700 (LWP 32735)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742cc20) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742cc20, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f6a0, sec=<optimized out>, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050f952 in cl_com_handle_write_thread (t_conf=0x7ffff742f6a0)
    at ../libs/comm/cl_commlib.c:7989
#5  0x0000003ea9407d14 in start_thread (arg=0x7fffeefff700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7fffefdc9700 (LWP 32734)):
#0  0x0000003ea8ce8eef in __GI___poll (fds=fds@entry=0x7fffef00e000, nfds=nfds@entry=2, 
    timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x000000000051ed02 in cl_com_tcp_open_connection_request_handler (
    poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, 
    connection_list=<optimized out>, 
    service_connection=service_connection@entry=0x7ffff74be300, 
    timeout_val_sec=<optimized out>, timeout_val_usec=<optimized out>, 
    select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_tcp_framework.c:1862
#2  0x00000000004f8582 in cl_com_open_connection_request_handler (
    poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, 
    timeout_val_sec=1, timeout_val_usec=<optimized out>, 
    select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_communication.c:3466
#3  0x000000000051044c in cl_com_handle_read_thread (t_conf=0x7ffff742f650)
    at ../libs/comm/cl_commlib.c:7381
#4  0x0000003ea9407d14 in start_thread (arg=0x7fffefdc9700) at pthread_create.c:309
#5  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7ffff05ca700 (LWP 32733)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742cb20) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742cb20, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f600, sec=<optimized out>, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050aff4 in cl_com_handle_service_thread (t_conf=0x7ffff742f600)
    at ../libs/comm/cl_commlib.c:7256
#5  0x0000003ea9407d14 in start_thread (arg=0x7ffff05ca700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7ffff0dcb700 (LWP 32732)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742c660) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742c660, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f5b0, sec=sec@entry=1, 
    micro_sec=micro_sec@entry=0) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050733f in cl_com_trigger_thread (t_conf=0x7ffff742f5b0)
    at ../libs/comm/cl_commlib.c:7174
#5  0x0000003ea9407d14 in start_thread (arg=0x7ffff0dcb700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ffff7fee780 (LWP 32731)):
#0  0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251
#1  0x0000003ea8cbaa8a in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:96
#2  0x0000003ea94102d5 in __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26
#3  0x0000000000432732 in sge_exec_job (ctx=ctx@entry=0x7ffff740e300, 
    jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0, 
    err_str=err_str@entry=0x7fffffffddd0 "\030", err_length=err_length@entry=256)
    at ../daemons/execd/exec_job.c:1866
#4  0x000000000043413c in exec_job_or_task (ctx=ctx@entry=0x7ffff740e300, 
---Type <return> to continue, or q <return> to quit---
    jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0)
    at ../daemons/execd/execd_ck_to_do.c:774
#5  0x0000000000435238 in sge_start_jobs (ctx=0x7ffff740e300)
    at ../daemons/execd/execd_ck_to_do.c:661
#6  do_ck_to_do (ctx=ctx@entry=0x7ffff740e300, is_qmaster_down=is_qmaster_down@entry=false)
    at ../daemons/execd/execd_ck_to_do.c:387
#7  0x000000000042b5eb in sge_execd_process_messages (ctx=0x7ffff740e300)
    at ../daemons/execd/dispatcher.c:332
#8  0x0000000000427ab1 in main (argc=1, argv=<optimized out>) at ../daemons/execd/execd.c:380

(gdb) list
246         ar_ptr = ar_ptr->next;
247         if(ar_ptr == &main_arena) break;
248       }
249       save_malloc_hook = __malloc_hook;
250       save_free_hook = __free_hook;
251       __malloc_hook = malloc_atfork;
252       __free_hook = free_atfork;
253       /* Only the current thread may perform malloc/free calls now. */
254       tsd_getspecific(arena_key, save_arena);
255       tsd_setspecific(arena_key, ATFORK_ARENA_PTR);
(gdb) print __malloc_hook
$8 = (void *(* const)(size_t)) 0x321e605070 <malloc>
(gdb) print malloc_atfork
$11 = {void *(size_t, const void *)} 0x3ea8c7fe20 <malloc_atfork>
(gdb) set __malloc_hook = malloc_atfork
(gdb) print __malloc_hook
$18 = (void *(* const)(size_t)) 0x3ea8c7fe20 <malloc_atfork>

On F16 with jemalloc 2.2.5 I see:
(gdb) print __malloc_hook
$1 = (void *(*)(size_t, const void *)) 0x33230828c0 <malloc_hook_ini>
(gdb) print malloc_atfork
$2 = {void *(size_t, const void *)} 0x33230827a0 <malloc_atfork>

Running under electricfence shows no errors.

I set a watchpoint on __malloc_hook but it doesn't trigger before the segfault.  It doesn't appear to change so I don't know why I get the segfault.

Version-Release number of selected component (if applicable):
jemalloc-3.0.0-1.fc17
jemalloc-3.0.0-1.fc16

How reproducible:
Nearly every time.  It has gone away at times though.

Comment 1 Ingvar Hagelund 2012-05-24 07:49:18 UTC
Orion,
can you please confirm that this scratch build fixes the problem.

http://koji.fedoraproject.org/koji/taskinfo?taskID=4097345

The only change from the 3.0.0-1 release is the patch mentioned on the list, 
http://www.canonware.com/pipermail/jemalloc-discuss/2012-May/000420.html

http://www.canonware.com/cgi-bin/gitweb.cgi?p=jemalloc.git;a=patch;h=5c710cee783a44061fa2c467ffd8984b8047b90e

Ingvar

Comment 2 Orion Poplawski 2012-05-24 15:08:26 UTC
That appears to work.  I'll try to keep testing but so far so good.

Comment 3 Fedora Update System 2012-05-25 11:04:21 UTC
jemalloc-3.0.0-2.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/jemalloc-3.0.0-2.fc17

Comment 4 Fedora Update System 2012-05-26 07:43:37 UTC
Package jemalloc-3.0.0-2.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing jemalloc-3.0.0-2.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-8409/jemalloc-3.0.0-2.fc17
then log in and leave karma (feedback).

Comment 5 Fedora Update System 2012-06-05 23:11:34 UTC
jemalloc-3.0.0-2.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.