Bug 824646 - sge_execd segfaults in ptmalloc_lock_all during fork
Summary: sge_execd segfaults in ptmalloc_lock_all during fork
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: jemalloc
Version: 17
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Ingvar Hagelund
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-23 21:36 UTC by Orion Poplawski
Modified: 2012-06-05 23:11 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2012-06-05 23:11:34 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Orion Poplawski 2012-05-23 21:36:23 UTC
Description of problem:

With jemalloc 3.0.0 I'm seeing segfaults in fork() in sge_execd on both i686 and x86_64.

If I run it under valgrind it runs fine, but no errors are reported.  This seems to confirm memory issues, but isn't helpful.

Temporary breakpoint 6, main (argc=1, argv=0x7fffffffe3f8) at ../daemons/execd/execd.c:150
150     {
(gdb) print __malloc_hook
$16 = (void *(* const)(size_t)) 0x321e605070 <malloc>

Program received signal SIGSEGV, Segmentation fault.
0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251
251       __malloc_hook = malloc_atfork;

(gdb) thread apply all bt

Thread 5 (Thread 0x7fffeefff700 (LWP 32735)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742cc20) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742cc20, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f6a0, sec=<optimized out>, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050f952 in cl_com_handle_write_thread (t_conf=0x7ffff742f6a0)
    at ../libs/comm/cl_commlib.c:7989
#5  0x0000003ea9407d14 in start_thread (arg=0x7fffeefff700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7fffefdc9700 (LWP 32734)):
#0  0x0000003ea8ce8eef in __GI___poll (fds=fds@entry=0x7fffef00e000, nfds=nfds@entry=2, 
    timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x000000000051ed02 in cl_com_tcp_open_connection_request_handler (
    poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, 
    connection_list=<optimized out>, 
    service_connection=service_connection@entry=0x7ffff74be300, 
    timeout_val_sec=<optimized out>, timeout_val_usec=<optimized out>, 
    select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_tcp_framework.c:1862
#2  0x00000000004f8582 in cl_com_open_connection_request_handler (
    poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, 
    timeout_val_sec=1, timeout_val_usec=<optimized out>, 
    select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_communication.c:3466
#3  0x000000000051044c in cl_com_handle_read_thread (t_conf=0x7ffff742f650)
    at ../libs/comm/cl_commlib.c:7381
#4  0x0000003ea9407d14 in start_thread (arg=0x7fffefdc9700) at pthread_create.c:309
#5  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7ffff05ca700 (LWP 32733)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742cb20) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742cb20, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f600, sec=<optimized out>, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050aff4 in cl_com_handle_service_thread (t_conf=0x7ffff742f600)
    at ../libs/comm/cl_commlib.c:7256
#5  0x0000003ea9407d14 in start_thread (arg=0x7ffff05ca700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7ffff0dcb700 (LWP 32732)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218
#1  0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, 
    sec=<optimized out>, condition=0x7ffff742c660) at ../libs/comm/lists/cl_thread.c:259
#2  cl_thread_wait_for_thread_condition (condition=0x7ffff742c660, sec=1, 
    micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191
#3  0x00000000005243d3 in cl_thread_wait_for_event (
    thread_config=thread_config@entry=0x7ffff742f5b0, sec=sec@entry=1, 
    micro_sec=micro_sec@entry=0) at ../libs/comm/lists/cl_thread.c:613
#4  0x000000000050733f in cl_com_trigger_thread (t_conf=0x7ffff742f5b0)
    at ../libs/comm/cl_commlib.c:7174
#5  0x0000003ea9407d14 in start_thread (arg=0x7ffff0dcb700) at pthread_create.c:309
#6  0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ffff7fee780 (LWP 32731)):
#0  0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251
#1  0x0000003ea8cbaa8a in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:96
#2  0x0000003ea94102d5 in __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26
#3  0x0000000000432732 in sge_exec_job (ctx=ctx@entry=0x7ffff740e300, 
    jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0, 
    err_str=err_str@entry=0x7fffffffddd0 "\030", err_length=err_length@entry=256)
    at ../daemons/execd/exec_job.c:1866
#4  0x000000000043413c in exec_job_or_task (ctx=ctx@entry=0x7ffff740e300, 
---Type <return> to continue, or q <return> to quit---
    jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0)
    at ../daemons/execd/execd_ck_to_do.c:774
#5  0x0000000000435238 in sge_start_jobs (ctx=0x7ffff740e300)
    at ../daemons/execd/execd_ck_to_do.c:661
#6  do_ck_to_do (ctx=ctx@entry=0x7ffff740e300, is_qmaster_down=is_qmaster_down@entry=false)
    at ../daemons/execd/execd_ck_to_do.c:387
#7  0x000000000042b5eb in sge_execd_process_messages (ctx=0x7ffff740e300)
    at ../daemons/execd/dispatcher.c:332
#8  0x0000000000427ab1 in main (argc=1, argv=<optimized out>) at ../daemons/execd/execd.c:380

(gdb) list
246         ar_ptr = ar_ptr->next;
247         if(ar_ptr == &main_arena) break;
248       }
249       save_malloc_hook = __malloc_hook;
250       save_free_hook = __free_hook;
251       __malloc_hook = malloc_atfork;
252       __free_hook = free_atfork;
253       /* Only the current thread may perform malloc/free calls now. */
254       tsd_getspecific(arena_key, save_arena);
255       tsd_setspecific(arena_key, ATFORK_ARENA_PTR);
(gdb) print __malloc_hook
$8 = (void *(* const)(size_t)) 0x321e605070 <malloc>
(gdb) print malloc_atfork
$11 = {void *(size_t, const void *)} 0x3ea8c7fe20 <malloc_atfork>
(gdb) set __malloc_hook = malloc_atfork
(gdb) print __malloc_hook
$18 = (void *(* const)(size_t)) 0x3ea8c7fe20 <malloc_atfork>

On F16 with jemalloc 2.2.5 I see:
(gdb) print __malloc_hook
$1 = (void *(*)(size_t, const void *)) 0x33230828c0 <malloc_hook_ini>
(gdb) print malloc_atfork
$2 = {void *(size_t, const void *)} 0x33230827a0 <malloc_atfork>

Running under electricfence shows no errors.

I set a watchpoint on __malloc_hook but it doesn't trigger before the segfault.  It doesn't appear to change so I don't know why I get the segfault.

Version-Release number of selected component (if applicable):
jemalloc-3.0.0-1.fc17
jemalloc-3.0.0-1.fc16

How reproducible:
Nearly every time.  It has gone away at times though.

Comment 1 Ingvar Hagelund 2012-05-24 07:49:18 UTC
Orion,
can you please confirm that this scratch build fixes the problem.

http://koji.fedoraproject.org/koji/taskinfo?taskID=4097345

The only change from the 3.0.0-1 release is the patch mentioned on the list, 
http://www.canonware.com/pipermail/jemalloc-discuss/2012-May/000420.html

http://www.canonware.com/cgi-bin/gitweb.cgi?p=jemalloc.git;a=patch;h=5c710cee783a44061fa2c467ffd8984b8047b90e

Ingvar

Comment 2 Orion Poplawski 2012-05-24 15:08:26 UTC
That appears to work.  I'll try to keep testing but so far so good.

Comment 3 Fedora Update System 2012-05-25 11:04:21 UTC
jemalloc-3.0.0-2.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/jemalloc-3.0.0-2.fc17

Comment 4 Fedora Update System 2012-05-26 07:43:37 UTC
Package jemalloc-3.0.0-2.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing jemalloc-3.0.0-2.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-8409/jemalloc-3.0.0-2.fc17
then log in and leave karma (feedback).

Comment 5 Fedora Update System 2012-06-05 23:11:34 UTC
jemalloc-3.0.0-2.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.