Description of problem: With jemalloc 3.0.0 I'm seeing segfaults in fork() in sge_execd on both i686 and x86_64. If I run it under valgrind it runs fine, but no errors are reported. This seems to confirm memory issues, but isn't helpful. Temporary breakpoint 6, main (argc=1, argv=0x7fffffffe3f8) at ../daemons/execd/execd.c:150 150 { (gdb) print __malloc_hook $16 = (void *(* const)(size_t)) 0x321e605070 <malloc> Program received signal SIGSEGV, Segmentation fault. 0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251 251 __malloc_hook = malloc_atfork; (gdb) thread apply all bt Thread 5 (Thread 0x7fffeefff700 (LWP 32735)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218 #1 0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, sec=<optimized out>, condition=0x7ffff742cc20) at ../libs/comm/lists/cl_thread.c:259 #2 cl_thread_wait_for_thread_condition (condition=0x7ffff742cc20, sec=1, micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191 #3 0x00000000005243d3 in cl_thread_wait_for_event ( thread_config=thread_config@entry=0x7ffff742f6a0, sec=<optimized out>, micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613 #4 0x000000000050f952 in cl_com_handle_write_thread (t_conf=0x7ffff742f6a0) at ../libs/comm/cl_commlib.c:7989 #5 0x0000003ea9407d14 in start_thread (arg=0x7fffeefff700) at pthread_create.c:309 #6 0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7fffefdc9700 (LWP 32734)): #0 0x0000003ea8ce8eef in __GI___poll (fds=fds@entry=0x7fffef00e000, nfds=nfds@entry=2, timeout=timeout@entry=1000) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x000000000051ed02 in cl_com_tcp_open_connection_request_handler ( poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, connection_list=<optimized out>, service_connection=service_connection@entry=0x7ffff74be300, timeout_val_sec=<optimized out>, timeout_val_usec=<optimized out>, select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_tcp_framework.c:1862 #2 0x00000000004f8582 in cl_com_open_connection_request_handler ( poll_handle=poll_handle@entry=0x7ffff742cb80, handle=handle@entry=0x7ffff74be180, timeout_val_sec=1, timeout_val_usec=<optimized out>, select_mode=select_mode@entry=CL_R_SELECT) at ../libs/comm/cl_communication.c:3466 #3 0x000000000051044c in cl_com_handle_read_thread (t_conf=0x7ffff742f650) at ../libs/comm/cl_commlib.c:7381 #4 0x0000003ea9407d14 in start_thread (arg=0x7fffefdc9700) at pthread_create.c:309 #5 0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7ffff05ca700 (LWP 32733)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218 #1 0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, sec=<optimized out>, condition=0x7ffff742cb20) at ../libs/comm/lists/cl_thread.c:259 #2 cl_thread_wait_for_thread_condition (condition=0x7ffff742cb20, sec=1, micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191 #3 0x00000000005243d3 in cl_thread_wait_for_event ( thread_config=thread_config@entry=0x7ffff742f600, sec=<optimized out>, micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:613 #4 0x000000000050aff4 in cl_com_handle_service_thread (t_conf=0x7ffff742f600) at ../libs/comm/cl_commlib.c:7256 #5 0x0000003ea9407d14 in start_thread (arg=0x7ffff05ca700) at pthread_create.c:309 #6 0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 2 (Thread 0x7ffff0dcb700 (LWP 32732)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:218 #1 0x0000000000523f9c in cl_thread_wait_for_thread_condition (micro_sec=<optimized out>, sec=<optimized out>, condition=0x7ffff742c660) at ../libs/comm/lists/cl_thread.c:259 #2 cl_thread_wait_for_thread_condition (condition=0x7ffff742c660, sec=1, micro_sec=<optimized out>) at ../libs/comm/lists/cl_thread.c:191 #3 0x00000000005243d3 in cl_thread_wait_for_event ( thread_config=thread_config@entry=0x7ffff742f5b0, sec=sec@entry=1, micro_sec=micro_sec@entry=0) at ../libs/comm/lists/cl_thread.c:613 #4 0x000000000050733f in cl_com_trigger_thread (t_conf=0x7ffff742f5b0) at ../libs/comm/cl_commlib.c:7174 #5 0x0000003ea9407d14 in start_thread (arg=0x7ffff0dcb700) at pthread_create.c:309 #6 0x0000003ea8cf199d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7ffff7fee780 (LWP 32731)): #0 0x0000003ea8c7a981 in ptmalloc_lock_all () at arena.c:251 #1 0x0000003ea8cbaa8a in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:96 #2 0x0000003ea94102d5 in __fork () at ../nptl/sysdeps/unix/sysv/linux/pt-fork.c:26 #3 0x0000000000432732 in sge_exec_job (ctx=ctx@entry=0x7ffff740e300, jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0, err_str=err_str@entry=0x7fffffffddd0 "\030", err_length=err_length@entry=256) at ../daemons/execd/exec_job.c:1866 #4 0x000000000043413c in exec_job_or_task (ctx=ctx@entry=0x7ffff740e300, ---Type <return> to continue, or q <return> to quit--- jep=jep@entry=0x7fffef03e180, jatep=jatep@entry=0x7ffff74f2280, petep=petep@entry=0x0) at ../daemons/execd/execd_ck_to_do.c:774 #5 0x0000000000435238 in sge_start_jobs (ctx=0x7ffff740e300) at ../daemons/execd/execd_ck_to_do.c:661 #6 do_ck_to_do (ctx=ctx@entry=0x7ffff740e300, is_qmaster_down=is_qmaster_down@entry=false) at ../daemons/execd/execd_ck_to_do.c:387 #7 0x000000000042b5eb in sge_execd_process_messages (ctx=0x7ffff740e300) at ../daemons/execd/dispatcher.c:332 #8 0x0000000000427ab1 in main (argc=1, argv=<optimized out>) at ../daemons/execd/execd.c:380 (gdb) list 246 ar_ptr = ar_ptr->next; 247 if(ar_ptr == &main_arena) break; 248 } 249 save_malloc_hook = __malloc_hook; 250 save_free_hook = __free_hook; 251 __malloc_hook = malloc_atfork; 252 __free_hook = free_atfork; 253 /* Only the current thread may perform malloc/free calls now. */ 254 tsd_getspecific(arena_key, save_arena); 255 tsd_setspecific(arena_key, ATFORK_ARENA_PTR); (gdb) print __malloc_hook $8 = (void *(* const)(size_t)) 0x321e605070 <malloc> (gdb) print malloc_atfork $11 = {void *(size_t, const void *)} 0x3ea8c7fe20 <malloc_atfork> (gdb) set __malloc_hook = malloc_atfork (gdb) print __malloc_hook $18 = (void *(* const)(size_t)) 0x3ea8c7fe20 <malloc_atfork> On F16 with jemalloc 2.2.5 I see: (gdb) print __malloc_hook $1 = (void *(*)(size_t, const void *)) 0x33230828c0 <malloc_hook_ini> (gdb) print malloc_atfork $2 = {void *(size_t, const void *)} 0x33230827a0 <malloc_atfork> Running under electricfence shows no errors. I set a watchpoint on __malloc_hook but it doesn't trigger before the segfault. It doesn't appear to change so I don't know why I get the segfault. Version-Release number of selected component (if applicable): jemalloc-3.0.0-1.fc17 jemalloc-3.0.0-1.fc16 How reproducible: Nearly every time. It has gone away at times though.
Orion, can you please confirm that this scratch build fixes the problem. http://koji.fedoraproject.org/koji/taskinfo?taskID=4097345 The only change from the 3.0.0-1 release is the patch mentioned on the list, http://www.canonware.com/pipermail/jemalloc-discuss/2012-May/000420.html http://www.canonware.com/cgi-bin/gitweb.cgi?p=jemalloc.git;a=patch;h=5c710cee783a44061fa2c467ffd8984b8047b90e Ingvar
That appears to work. I'll try to keep testing but so far so good.
jemalloc-3.0.0-2.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/jemalloc-3.0.0-2.fc17
Package jemalloc-3.0.0-2.fc17: * should fix your issue, * was pushed to the Fedora 17 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing jemalloc-3.0.0-2.fc17' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-8409/jemalloc-3.0.0-2.fc17 then log in and leave karma (feedback).
jemalloc-3.0.0-2.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.