The crashing application is a feature-rich scheduler tightly integrated with Oracle (8.1.5). It does not show this problem with linux-2.0.35 with glibc-2.0.7 (Oracle 8.0.5) and 8 other Unix flavours. The scheduler crashes when it fork/execs two agents. These agents connect to the Oracle database by means of the BEQ-protocol (a helper process is fork/execed by the oracle OCI-library). The last fork() call returns the child pid to the parent but the child process never reaches the code after the fork. The child dumps core before that.. Here's a typescript of a gdb session opening the core-file: (gdb) GNU gdb 4.17.0.11 with Linux support Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... Core was generated by `jcs master agent LX05'. Program terminated with signal 11, Segmentation fault. Reading symbols from /home/oracle/product/lx815/lib/libskgxp8.so...done. Reading symbols from /lib/libdl.so.2...done. Reading symbols from /lib/libm.so.6...done. Reading symbols from /lib/libpthread.so.0...done. Reading symbols from /lib/libcrypt.so.1...done. Reading symbols from /lib/libc.so.6...done. Reading symbols from /lib/ld-linux.so.2...done. Reading symbols from /lib/libnss_nisplus.so.2...done. Reading symbols from /lib/libnsl.so.1...done. Reading symbols from /lib/libnss_files.so.2...done. Reading symbols from /lib/libnss_nis.so.2...done. Reading symbols from /lib/libnss_dns.so.2...done. Reading symbols from /lib/libresolv.so.2...done. #0 __pthread_mutex_init (mutex=0x0, mutex_attr=0xbfff8024) at spinlock.h:59 spinlock.h:59: No such file or directory. (gdb) info stack #0 __pthread_mutex_init (mutex=0x0, mutex_attr=0xbfff8024) at spinlock.h:59 #1 0x40041950 in __fresetlockfiles () at lockfile.c:83 #2 0x4003fbc8 in fork () at ptfork.c:92 #3 0x80f47cd in rsdmfor () #4 0x80ee315 in rsssrsc () #5 0x804c129 in main () #6 0x40090cb3 in __libc_start_main (main=0x804c040 <main>, argc=2, argv=0xbffffc84, init=0x804afdc <_init>, fini=0x849879c <_fini>, rtld_fini=0x4000a350 <_dl_fini>, stack_end=0xbffffc7c) at ../sysdeps/generic/libc-start.c:78 (gdb) The machine involved is a 512Mb RAM dual Pentium III (2x450MHz). I've tried the SMP, non-SMP kernel and Linux-2.2.10ac12. The "dmesg" utility shows no related messages. Please adjust the priority/severity to your taste ;-)
I will need a small example that I can use to test for the glibc problem. Also, have you tried the newer glibc packages available from rawhide?
I tried glibc from rawhide as you proposed, but it shows the same problem. Here's a patch I applied for glibc to provide a workaround: ==== CUT HERE ==== *** glibc.old/linuxthreads/lockfile.c Wed Aug 11 19:52:27 1999 --- glibc/linuxthreads/lockfile.c Wed Aug 11 19:48:52 1999 *************** *** 80,86 **** --- 80,90 ---- __pthread_mutexattr_settype (&attr, PTHREAD_MUTEX_RECURSIVE_NP); for (fp = _IO_list_all; fp != NULL; fp = fp->_chain) + #if 0 __pthread_mutex_init (fp->_lock, &attr); + #else + if (fp->_lock) __pthread_mutex_init (fp->_lock, &attr); + #endif __pthread_mutexattr_destroy (&attr); #endif ==== CUT HERE ==== I still do not know why this problem occures. Somehow the _lock member gets set to 0, causing __pthread_mutex_init to segfault. I have been unable to isolate the problem in a small piece of sample code.. Any ideas? BTW. there's a core-file distributed with the glibc source tree.
patch applied in glibc-2.1.2-5 and later
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/16c9e6aa7306fac2458923dc422c5f0b5682c43a Fix for issue #4437 - restarting the haproxy router still dispatches connections to a downed backend.