From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686) Description of problem: During the tests of SAP's latest Business Application Server on RedHat 7.1 it occurs that the app aborts with segmentation fault. The executable is compiled and linked on RH 6.1. How reproducible: Sometimes Steps to Reproduce: Core was generated by `dw.sapWAS_DVEBMGS18 pf=/usr/sap/WAS/SYS/profile/WAS_DVEBMGS18_lwsp700'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/i686/libpthread.so.0...done. warning: Unable to set global thread event mask: generic error [New Thread 1024 (LWP 11580)] Error while reading shared library symbols: Cannot enable thread event reporting for Thread 1024 (LWP 11580): generic error Reading symbols from /usr/sap/WAS/SYS/exe/run/dw_xml.so...done. Loaded symbols for /usr/sap/WAS/SYS/exe/run/dw_xml.so Reading symbols from /usr/sap/WAS/SYS/exe/run/dw_xtc.so...done. Loaded symbols for /usr/sap/WAS/SYS/exe/run/dw_xtc.so Reading symbols from /usr/sap/WAS/SYS/exe/run/dw_stl.so...done. Loaded symbols for /usr/sap/WAS/SYS/exe/run/dw_stl.so Reading symbols from /usr/lib/libstdc++-libc6.1-2.so.3...done. Loaded symbols for /usr/lib/libstdc++-libc6.1-2.so.3 Reading symbols from /lib/i686/libm.so.6...done. Loaded symbols for /lib/i686/libm.so.6 Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 #0 __pthread_alt_lock (lock=0x40869124, self=0x0) at spinlock.c:407 407 spinlock.c: No such file or directory. in spinlock.c (gdb) bt #0 __pthread_alt_lock (lock=0x40869124, self=0x0) at spinlock.c:407 #1 0x40028c86 in __pthread_mutex_lock (mutex=0x40869114) at mutex.c:120 #2 0x40862e31 in __register_frame_info (begin=0x402a3d80, ob=0x402d5cc0) at ../../gcc/frame.c:627 #3 0x400c52d2 in _init () from /usr/sap/WAS/SYS/exe/run/dw_xml.so #4 0x400c0131 in _init () from /usr/sap/WAS/SYS/exe/run/dw_xml.so #5 0x4000df57 in _dl_init () at eval.c:41 (gdb) [wasadm@lwsp700 work]$ Additional info: The same executable is running without Problems on SuSE 7.2. Both distributions claim to have glibc 2.2.2.
Can you try running it with LD_ASSUME_KERNEL=2.2.5 in the environment? AFAIK SuSE does not enable floating thread stacks (the above environ variable disables them in RHL 7.1). Can you attach strace log? Also, can you run it under debugger and see what %gs register contains when it crashes?
I started the problematic executable with strace -ff -o trace <command> but got no output files (I noticed a strange behaviour of the threads I can not explain). Running with gdb does not show the segmentation fault. Setting the environment variable LD_ASSUME_KERNEL seems to be a workaround. I did not see the segmentation fault any more. Thanks for the workaround.
Can you try LD_DEBUG=all to see if it really crashes before transferring control: foobarbaz line? If yes, would it be possible to pack the main binary and its DT_NEEDED libraries somewhere, so that I could check it out myself?
I started the application with LD_DEBUG=all (without LD_ASSUME_KERNEL), the segmenation fault does not occur. An output of ~120MB was generated (I do not think your are interessted in). To reproduce the error by yourself it is necessary to install the SAP Application Server Software (several executables need to run in parallel), the SAPDB 7.3 Database Software and to create an initial database with SAP Business Software in it (because the executales do not run without db connect). During the initialization of the SAP System and without LD_ASSUME_KERNEL in the environment the application server stops with seg fault (reproduceable, but only in initialization phase). For replaying you need ~750MB of compressed apps and db data. I can not imagine you are really interessed in that stuff. I downloaded 7.1.93 (roswell) and replayed the initialization without LD_ASSUME_KERNEL - no segmentation fault. There seems to be change in glibc 2.2.2 -> glibc 2.2.3 that lets the problem disappear.
I currently tried to create a db2 database (DB/2 Version 7.1). The database tools crashed with the same behaviour: [root@lwsp700 install]# gdb /usr/IBMdb2/V7.1/instance/db2icknm core GNU gdb 5.0rh-5 Red Hat Linux 7.1 Copyright 2001 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... Core was generated by `/usr/IBMdb2/V7.1/instance/db2icknm db2xxx dbxxxadm'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/IBMdb2/V7.1/lib/libdb2.so.1...done. Loaded symbols for /usr/IBMdb2/V7.1/lib/libdb2.so.1 Reading symbols from /usr/lib/libstdc++-libc6.1-1.so.2...done. Loaded symbols for /usr/lib/libstdc++-libc6.1-1.so.2 Reading symbols from /lib/i686/libm.so.6...done. Loaded symbols for /lib/i686/libm.so.6 Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/i686/libpthread.so.0...done. warning: Unable to set global thread event mask: generic error [New Thread 1024 (LWP 17864)] Error while reading shared library symbols: Cannot enable thread event reporting for Thread 1024 (LWP 17864): generic error Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 #0 __pthread_alt_lock (lock=0x40a58e58, self=0x0) at spinlock.c:407 407 spinlock.c: No such file or directory. in spinlock.c (gdb) bt #0 __pthread_alt_lock (lock=0x40a58e58, self=0x0) at spinlock.c:407 #1 0x40a4ec86 in __pthread_mutex_lock (mutex=0x40a58e48) at mutex.c:120 #2 0x40a4f162 in __pthread_atfork (prepare=0x40962864 <ptmalloc_lock_all>, parent=0x40966f80 <ptmalloc_unlock_all>, child=0x40967070 <ptmalloc_init_all>) at ptfork.c:60 #3 0x40962a6d in ptmalloc_init () at malloc.c:1716 #4 0x40967184 in malloc_hook_ini (sz=24, caller=0x40a4f13e) at malloc.c:1765 #5 0x4096309d in __libc_malloc (bytes=24) at malloc.c:2701 #6 0x40a4f13e in __pthread_atfork (prepare=0, parent=0, child=0x404eafc0 <sqlo_child_reset_pid(void)>) at ptfork.c:57 #7 0x404eb012 in sqlo_init_pid () at eval.c:41 #8 0x404eb054 in global constructors keyed to waste_time () at eval.c:41 #9 0x4055ee34 in __do_global_ctors_aux () at eval.c:41 #10 0x400e4dc6 in _init () at eval.c:41 #11 0x4000df57 in _dl_init () at eval.c:41 Probably it is easier for you to reproduce the error in a not so complex environment like SAP Systems.
The workaround with LD_ASSUME_KERNEL=2.2.5 works.
Can you try 2.2.8 kernel or at least the changes to arch/i386/kernel/{ldt,process}.c and include/asm-i386/{mmu,mmu_context}.h? It contains an important SMP LDT handling fix.
I suppose you meen kernel 2.4.8 (as far as I know no 2.2.x kernel is supported on RH 7.1). I downloaded 2.4.8 (Linus version) and installed it. The problem did not occur.
Typo, sorry. If you could test ftp://people.redhat.com/jakub/kernel/ 2.4.7-2 too, it would be great, but I hope it is this patch which matters.
I upgraded my RH 7.1 with kernel 2.4.7-2enterprise ( including required mkinitrd and e2fsprogras I got with roswell). I restarted a complete SAP installation without problems. The core dump problem did not occur.