Bug 50661
Summary: | Intermittent crashes in __errno_location() on Red Hat Linux 7.1. | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Wan-Teh Chang <wtc> |
Component: | kernel | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Aaron Brown <abrown> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.1 | CC: | mcs |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2002-12-15 17:22:35 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Wan-Teh Chang
2001-08-01 22:17:37 UTC
__errno_location should fail only if %gs is mucked with. a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment? b) does that program ever do anything with %gs register, or call modify_ldt syscall? c) when it crashes, what value do you see in %gs? Jakub wrote: > > a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment? I'll give this a try and let you know. > b) does that program ever do anything with %gs register, or call modify_ldt > syscall? No. > c) when it crashes, what value do you see in %gs? 0x2b. I have not tried to run it under the debugger, so I don't know if it would crash that way too. Here is the output of the 'info reg' command on two of the core files: (gdb) info reg eax 0xffffff9f -97 ecx 0xbffff370 -1073745040 edx 0x61 97 ebx 0x401a16d8 1075451608 esp 0xbffff354 0xbffff354 ebp 0xbffff35c 0xbffff35c esi 0x40016b64 1073834852 edi 0xbffff4dc -1073744676 eip 0x40196e16 0x40196e16 eflags 0x10206 66054 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x2b 43 gs 0x2b 43 fctrl 0x0 0 fstat 0x0 0 ftag 0x0 0 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 (gdb) info reg eax 0xffffff9f -97 ecx 0xbffff500 -1073744640 edx 0x61 97 ebx 0x4007e6d8 1074259672 esp 0xbffff4e4 0xbffff4e4 ebp 0xbffff4ec 0xbffff4ec esi 0x40016b64 1073834852 edi 0xbffff6ac -1073744212 eip 0x40073e16 0x40073e16 eflags 0x10206 66054 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x2b 43 gs 0x2b 43 fctrl 0x0 0 fstat 0x0 0 ftag 0x0 0 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 Then the thing is where %gs got this value from. If you start a program linked against -lpthread, it should have 0x7 in %gs (that's for the initial thread), later on it can contain other values for other threads, but never 0x2b. It seems that the 0x2b value in %gs in the core files cannot be trusted. For example, I have a simple test program that saves the value of %gs in __gs and then dereferences a null pointer: % cat foo.c int main() { int *p = 0; unsigned int __gs; asm ("mov %%gs, %0" : "=r" (__gs)); *p = 1; /* crash */ return 0; } I build it with -pthread and confirm that it is linked with -lpthread: % gcc -g -pthread foo.c -o foo % ldd foo libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40021000) libc.so.6 => /lib/i686/libc.so.6 (0x40036000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) % ./foo Segmentation fault (core dumped) Invoking the debugger on the core file, I see 0x7 in __gs but 0x2b in %gs:: % gdb foo core GNU gdb 5.0rh-5 Red Hat Linux 7.1 Copyright 2001 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... Core was generated by `./foo'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/i686/libpthread.so.0...done. warning: Unable to set global thread event mask: generic error [New Thread 1024 (LWP 31416)] Error while reading shared library symbols: Cannot enable thread event reporting for Thread 1024 (LWP 31416): generic error Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 #0 0x08048457 in main () at foo.c:7 7 *p = 1; /* crash */ (gdb) p/x __gs $1 = 0x7 (gdb) p/x $gs $2 = 0x2b (gdb) However, if I run the test program inside the debugger, I see 0x7 in both __gs and %gs: % gdb foo GNU gdb 5.0rh-5 Red Hat Linux 7.1 Copyright 2001 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... (gdb) run Starting program: /u/wtc/nss-3.3/box-build-all/foo [New Thread 1024 (LWP 31421)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1024 (LWP 31421)] 0x08048457 in main () at foo.c:7 7 *p = 1; /* crash */ (gdb) p/x __gs $1 = 0x7 (gdb) p/x $gs $2 = 0x7 (gdb) I noticed that the top of the stack trace of the core files doesn't look right. The five function calls at the top are listed here. #0 __errno_location () at errno.c:25 #1 0x4016a484 in __socket () from /lib/i686/libc.so.6 #2 0x4002f5de in _pr_init_ipv6 () at pripv6.c:330 #3 0x4003e11f in _PR_InitStuff () at prinit.c:244 #4 0x4003e137 in _PR_ImplicitInitialization () at prinit.c:252 Between __socket () and _pr_init_ipv6 (), we should have socket () and _pr_test_ipv6_socket (). I reproduce the relevant source files here. --- pripv6.c --- 329 330 _pr_ipv6_is_present = _pr_test_ipv6_socket(); 331 if (PR_TRUE == _pr_ipv6_is_present) 332 return PR_SUCCESS; --- ptio.c --- 3103 PR_IMPLEMENT(PRBool) _pr_test_ipv6_socket() 3104 { 3105 PRInt32 osfd; 3106 3107 osfd = socket(AF_INET6, SOCK_STREAM, 0); 3108 if (osfd != -1) { 3109 close(osfd); 3110 return PR_TRUE; 3111 } 3112 return PR_FALSE; 3113 } We are crashing intermittently inside the socket () call in the _pr_test_ipv6_socket () function at ptio.c:3107, but socket () and _pr_test_ipv6_socket () are not in the stack trace of the core files. This is weird. As you suggested, I added the following assertion before and after that socket () call at ptio.c:3107. { unsigned int __gs; __asm ("mov %%gs, %0" : "=r" (__gs)); assert(__gs == 0x7); } Our tests still crash intermittently inside that socket () call at ptio.c:3107, but the assertions never fail. This means %gs is always 0x7 before we enter that socket () call. Therefore, if __errno_location () can only crash because of a bad value in %gs, I conclude that it must be socket () or some other function in libc.so.6 called by socket () that mucks with %gs. (The tests that crash this way do not create any threads even though they are linked with -lpthread.) Setting LD_ASSUME_KERNEL to 2.2.5 works. I ran our tests over a hundred times without any core dumps. I will looking into trying the latest kernel. Is there anything else I can do to help investigate this bug? Can you try 2.4.7-2 kernels from rawhide? I believe it should fix the bug. Jakub, I still haven't been able to try the 2.4.7-2 kernel that you suggested. Red Hat Linux 7.2 is now released. I learned from Red Hat's web site that Red Hat Linux 7.2 uses the 2.4.7 kernel. What is the relation between the 2.4.7 and 2.4.7-2 kernels? In particular, does the kernel in Red Hat Linux 7.2 fix this bug? Jakub, I finally have a dual-CPU PC at my disposal. I've put stock Red Hat Linux 7.1 on it and reproduced the core dump. Now I'm ready to a new kernel to see if it fixed this bug. Could you tell me which kernel I should try (on a 7.1 machine) and where I can download the RPM from? Thanks. Try the ones in http://www.redhat.com/support/errata/RHSA-2001-142.html for 7.1. I have good news to report. After installing kernel-smp-2.4.9-12 on my Red Hat Linux 7.1 box (2-processor i686), I haven't been able to reproduce this crash. Jakub, what is the kernel bug that causes these crashes? In http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.8, I found this item: pre1: ... - James Washer: LDT loading SMP bug fix Is this the one? Is this bug fixed in the 2.4.7-2 kernels (which you told me to try originally)? |