From Bugzilla Helper: User-Agent: Mozilla/4.7 [en]C-AOLNSCP (WinNT; U) Description of problem: I filed this bug as libc/2450 at http://bugs.gnu.org/cgi-bin/gnatsweb.pl. Since we are using a Red Hat product, I thought I should also file a bug here. The text below is copied from the libc/2450 bug report. Our project is a set of C libraries and command-line tools. Although we are not using the -pthread flag, all the C files are compiled with -D_REENTRANT and all the executables are linked with -lpthread before any other system libraries. We are having intermittent segmentation faults when running our command-line tools or test programs on Red Hat Linux 7.1. (Both of our Red Hat Linux 7.1 systems are dual-processors.) The frequency of the core dump is about 1/10 - 2/10. Almost all of the core files have a stack trace that looks like this: (gdb) where #0 __errno_location () at errno.c:25 #1 0x4016a484 in __socket () from /lib/i686/libc.so.6 #2 0x4002f5de in _pr_init_ipv6 () at pripv6.c:330 #3 0x4003e11f in _PR_InitStuff () at prinit.c:244 #4 0x4003e137 in _PR_ImplicitInitialization () at prinit.c:252 #5 0x4003e174 in PR_Init (type=PR_SYSTEM_THREAD, priority=PR_PRIORITY_NORMAL, maxPTDs=1) at prinit.c:303 #6 0x0805188a in main (argc=6, argv=0xbffff54c) at certutil.c:2468 #7 0x4009f177 in __libc_start_main (main=0x8050ba8 <main>, argc=6, ubp_av=0xbffff54c, init=0x804be8c <_init>, fini=0x80eb140 <_fini>, rtld_fini=0x4000e184 <_dl_fini>, stack_end=0xbffff53c) at ../sysdeps/generic/libc-start.c:129 (gdb) Note on the validity of the core files: although the command-line tools that crashed were compiled with -D_REENTRANT and linked with -lpthread, they do not create any threads. The output of the 'ldd' command on one of the executables that crashed shows that we link with -lpthread before -lc: % ldd certutil libplc4.so => not found libplds4.so => not found libnspr4.so => not found libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40021000) libdl.so.2 => /lib/libdl.so.2 (0x40036000) libc.so.6 => /lib/i686/libc.so.6 (0x4003a000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) How reproducible: Sometimes Steps to Reproduce: I don't have a small test program that reproduces the problem, so I am afraid that you will need to build our project and run our tests. Our source code can be downloaded as a gzipped tar file from ftp://ftp.mozilla.org/pub/security/nss/releases/NSS_3_3_RTM/src/nss-3.3.tar.gz. To build it, unpack the tar file and do: % cd mozilla/security/nss % make nss_build_all To run the tests, follow these steps: % cd mozilla/security/nss/tests % ./all.sh The test results will be in mozilla/tests_results/security/<host>.<n>, where <host> is the host name and <n> is 1, 2, ... indicating the n-th run of ./all.sh. There are two files in the <host>.<n> directory that you need to look at: results.html and output.log. If any of the tests crashes, the core file will be found under a subdirectory in <host>.<n>. You may need to run ./all.sh repeatedly to cause a segmentation fault because the crash is intermittent. The executables of the tests are in mozilla/dist/Linux2.4_x86_glibc_PTH_DBG.OBJ/bin. Actual Results: Got segmentation faults from command-line tools such as certutil or pk12util intermittently. Expected Results: All tests should pass. The results.html files under mozilla/tests_results/security/<host>.<n> should be all "Passed" in green. Additional info: OS: Red Hat Linux 7.1 Kernel: 2.4.2-2smp on a 2-processor i686 gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-81) glibc version 2.2.2
__errno_location should fail only if %gs is mucked with. a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment? b) does that program ever do anything with %gs register, or call modify_ldt syscall? c) when it crashes, what value do you see in %gs?
Jakub wrote: > > a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment? I'll give this a try and let you know. > b) does that program ever do anything with %gs register, or call modify_ldt > syscall? No. > c) when it crashes, what value do you see in %gs? 0x2b. I have not tried to run it under the debugger, so I don't know if it would crash that way too. Here is the output of the 'info reg' command on two of the core files: (gdb) info reg eax 0xffffff9f -97 ecx 0xbffff370 -1073745040 edx 0x61 97 ebx 0x401a16d8 1075451608 esp 0xbffff354 0xbffff354 ebp 0xbffff35c 0xbffff35c esi 0x40016b64 1073834852 edi 0xbffff4dc -1073744676 eip 0x40196e16 0x40196e16 eflags 0x10206 66054 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x2b 43 gs 0x2b 43 fctrl 0x0 0 fstat 0x0 0 ftag 0x0 0 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 (gdb) info reg eax 0xffffff9f -97 ecx 0xbffff500 -1073744640 edx 0x61 97 ebx 0x4007e6d8 1074259672 esp 0xbffff4e4 0xbffff4e4 ebp 0xbffff4ec 0xbffff4ec esi 0x40016b64 1073834852 edi 0xbffff6ac -1073744212 eip 0x40073e16 0x40073e16 eflags 0x10206 66054 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x2b 43 gs 0x2b 43 fctrl 0x0 0 fstat 0x0 0 ftag 0x0 0 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0
Then the thing is where %gs got this value from. If you start a program linked against -lpthread, it should have 0x7 in %gs (that's for the initial thread), later on it can contain other values for other threads, but never 0x2b.
It seems that the 0x2b value in %gs in the core files cannot be trusted. For example, I have a simple test program that saves the value of %gs in __gs and then dereferences a null pointer: % cat foo.c int main() { int *p = 0; unsigned int __gs; asm ("mov %%gs, %0" : "=r" (__gs)); *p = 1; /* crash */ return 0; } I build it with -pthread and confirm that it is linked with -lpthread: % gcc -g -pthread foo.c -o foo % ldd foo libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40021000) libc.so.6 => /lib/i686/libc.so.6 (0x40036000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) % ./foo Segmentation fault (core dumped) Invoking the debugger on the core file, I see 0x7 in __gs but 0x2b in %gs:: % gdb foo core GNU gdb 5.0rh-5 Red Hat Linux 7.1 Copyright 2001 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... Core was generated by `./foo'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/i686/libpthread.so.0...done. warning: Unable to set global thread event mask: generic error [New Thread 1024 (LWP 31416)] Error while reading shared library symbols: Cannot enable thread event reporting for Thread 1024 (LWP 31416): generic error Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 #0 0x08048457 in main () at foo.c:7 7 *p = 1; /* crash */ (gdb) p/x __gs $1 = 0x7 (gdb) p/x $gs $2 = 0x2b (gdb) However, if I run the test program inside the debugger, I see 0x7 in both __gs and %gs: % gdb foo GNU gdb 5.0rh-5 Red Hat Linux 7.1 Copyright 2001 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... (gdb) run Starting program: /u/wtc/nss-3.3/box-build-all/foo [New Thread 1024 (LWP 31421)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1024 (LWP 31421)] 0x08048457 in main () at foo.c:7 7 *p = 1; /* crash */ (gdb) p/x __gs $1 = 0x7 (gdb) p/x $gs $2 = 0x7 (gdb)
I noticed that the top of the stack trace of the core files doesn't look right. The five function calls at the top are listed here. #0 __errno_location () at errno.c:25 #1 0x4016a484 in __socket () from /lib/i686/libc.so.6 #2 0x4002f5de in _pr_init_ipv6 () at pripv6.c:330 #3 0x4003e11f in _PR_InitStuff () at prinit.c:244 #4 0x4003e137 in _PR_ImplicitInitialization () at prinit.c:252 Between __socket () and _pr_init_ipv6 (), we should have socket () and _pr_test_ipv6_socket (). I reproduce the relevant source files here. --- pripv6.c --- 329 330 _pr_ipv6_is_present = _pr_test_ipv6_socket(); 331 if (PR_TRUE == _pr_ipv6_is_present) 332 return PR_SUCCESS; --- ptio.c --- 3103 PR_IMPLEMENT(PRBool) _pr_test_ipv6_socket() 3104 { 3105 PRInt32 osfd; 3106 3107 osfd = socket(AF_INET6, SOCK_STREAM, 0); 3108 if (osfd != -1) { 3109 close(osfd); 3110 return PR_TRUE; 3111 } 3112 return PR_FALSE; 3113 } We are crashing intermittently inside the socket () call in the _pr_test_ipv6_socket () function at ptio.c:3107, but socket () and _pr_test_ipv6_socket () are not in the stack trace of the core files. This is weird. As you suggested, I added the following assertion before and after that socket () call at ptio.c:3107. { unsigned int __gs; __asm ("mov %%gs, %0" : "=r" (__gs)); assert(__gs == 0x7); } Our tests still crash intermittently inside that socket () call at ptio.c:3107, but the assertions never fail. This means %gs is always 0x7 before we enter that socket () call. Therefore, if __errno_location () can only crash because of a bad value in %gs, I conclude that it must be socket () or some other function in libc.so.6 called by socket () that mucks with %gs. (The tests that crash this way do not create any threads even though they are linked with -lpthread.)
Setting LD_ASSUME_KERNEL to 2.2.5 works. I ran our tests over a hundred times without any core dumps. I will looking into trying the latest kernel. Is there anything else I can do to help investigate this bug?
Can you try 2.4.7-2 kernels from rawhide? I believe it should fix the bug.
Jakub, I still haven't been able to try the 2.4.7-2 kernel that you suggested. Red Hat Linux 7.2 is now released. I learned from Red Hat's web site that Red Hat Linux 7.2 uses the 2.4.7 kernel. What is the relation between the 2.4.7 and 2.4.7-2 kernels? In particular, does the kernel in Red Hat Linux 7.2 fix this bug?
Jakub, I finally have a dual-CPU PC at my disposal. I've put stock Red Hat Linux 7.1 on it and reproduced the core dump. Now I'm ready to a new kernel to see if it fixed this bug. Could you tell me which kernel I should try (on a 7.1 machine) and where I can download the RPM from? Thanks.
Try the ones in http://www.redhat.com/support/errata/RHSA-2001-142.html for 7.1.
I have good news to report. After installing kernel-smp-2.4.9-12 on my Red Hat Linux 7.1 box (2-processor i686), I haven't been able to reproduce this crash. Jakub, what is the kernel bug that causes these crashes? In http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.8, I found this item: pre1: ... - James Washer: LDT loading SMP bug fix Is this the one? Is this bug fixed in the 2.4.7-2 kernels (which you told me to try originally)?