50661 – Intermittent crashes in __errno_location() on Red Hat Linux 7.1.

Bug 50661 - Intermittent crashes in __errno_location() on Red Hat Linux 7.1.

Summary: Intermittent crashes in __errno_location() on Red Hat Linux 7.1.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Aaron Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-01 22:17 UTC by Wan-Teh Chang
Modified:	2005-10-31 22:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-12-15 17:22:35 UTC
Embargoed:

Attachments	(Terms of Use)

Description Wan-Teh Chang 2001-08-01 22:17:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.7 [en]C-AOLNSCP  (WinNT; U)

Description of problem:
I filed this bug as libc/2450 at http://bugs.gnu.org/cgi-bin/gnatsweb.pl.
Since we are using a Red Hat product, I thought I should also file a bug
here.  The text below is copied from the libc/2450 bug report.

Our project is a set of C libraries and command-line tools.
Although we are not using the -pthread flag, all the C files
are compiled with -D_REENTRANT and all the executables are
linked with -lpthread before any other system libraries.

We are having intermittent segmentation faults when running
our command-line tools or test programs on Red Hat Linux 7.1.
(Both of our Red Hat Linux 7.1 systems are dual-processors.)
The frequency of the core dump is about 1/10 - 2/10. Almost
all of the core files have a stack trace that looks like
this:

(gdb) where
#0 __errno_location () at errno.c:25
#1 0x4016a484 in __socket () from /lib/i686/libc.so.6
#2 0x4002f5de in _pr_init_ipv6 () at pripv6.c:330
#3 0x4003e11f in _PR_InitStuff () at prinit.c:244
#4 0x4003e137 in _PR_ImplicitInitialization () at prinit.c:252
#5 0x4003e174 in PR_Init (type=PR_SYSTEM_THREAD, priority=PR_PRIORITY_NORMAL, 
maxPTDs=1) at prinit.c:303
#6 0x0805188a in main (argc=6, argv=0xbffff54c) at certutil.c:2468
#7 0x4009f177 in __libc_start_main (main=0x8050ba8 <main>, argc=6, 
ubp_av=0xbffff54c, init=0x804be8c <_init>, fini=0x80eb140 <_fini>, 
rtld_fini=0x4000e184 <_dl_fini>, stack_end=0xbffff53c)
at ../sysdeps/generic/libc-start.c:129
(gdb)

Note on the validity of the core files: although the
command-line tools that crashed were compiled with
-D_REENTRANT and linked with -lpthread, they do not create
any threads.

The output of the 'ldd' command on one of the executables
that crashed shows that we link with -lpthread before -lc:

% ldd certutil
libplc4.so => not found
libplds4.so => not found
libnspr4.so => not found
libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40021000)
libdl.so.2 => /lib/libdl.so.2 (0x40036000)
libc.so.6 => /lib/i686/libc.so.6 (0x4003a000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

How reproducible:
Sometimes

Steps to Reproduce:
I don't have a small test program that reproduces the
problem, so I am afraid that you will need to build our
project and run our tests.

Our source code can be downloaded as a gzipped tar file from
ftp://ftp.mozilla.org/pub/security/nss/releases/NSS_3_3_RTM/src/nss-3.3.tar.gz.

To build it, unpack the tar file and do:
% cd mozilla/security/nss
% make nss_build_all

To run the tests, follow these steps:
% cd mozilla/security/nss/tests
% ./all.sh

The test results will be in mozilla/tests_results/security/<host>.<n>,
where <host> is the host name and <n> is 1, 2, ... indicating the n-th
run of ./all.sh. There are two files in the <host>.<n> directory that
you need to look at: results.html and output.log. If any of the tests
crashes, the core file will be found under a subdirectory in <host>.<n>.
You may need to run ./all.sh repeatedly to cause a segmentation fault
because the crash is intermittent.

The executables of the tests are in
mozilla/dist/Linux2.4_x86_glibc_PTH_DBG.OBJ/bin.	

Actual Results:  Got segmentation faults from command-line tools such as
certutil or pk12util intermittently.

Expected Results:  All tests should pass.  The results.html files under
mozilla/tests_results/security/<host>.<n> should be
all "Passed" in green.

Additional info:

OS: Red Hat Linux 7.1
Kernel: 2.4.2-2smp on a 2-processor i686
gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-81)
glibc version 2.2.2

Comment 1 Jakub Jelinek 2001-08-03 12:33:27 UTC

__errno_location should fail only if %gs is mucked with.
a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment?
b) does that program ever do anything with %gs register, or call modify_ldt
   syscall?
c) when it crashes, what value do you see in %gs?

Comment 2 Wan-Teh Chang 2001-08-03 15:35:51 UTC

Jakub wrote:
>
> a) can you reproduce it with LD_ASSUME_KERNEL=2.2.5 set in environment?

I'll give this a try and let you know.

>  b) does that program ever do anything with %gs register, or call modify_ldt
>  syscall?

No.

>  c) when it crashes, what value do you see in %gs?

0x2b.

I have not tried to run it under the debugger, so I don't know if it would
crash that way too.

Here is the output of the 'info reg' command on two of the core files:

(gdb) info reg
eax            0xffffff9f       -97
ecx            0xbffff370       -1073745040
edx            0x61     97
ebx            0x401a16d8       1075451608
esp            0xbffff354       0xbffff354
ebp            0xbffff35c       0xbffff35c
esi            0x40016b64       1073834852
edi            0xbffff4dc       -1073744676
eip            0x40196e16       0x40196e16
eflags         0x10206  66054
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x2b     43
gs             0x2b     43
fctrl          0x0      0
fstat          0x0      0
ftag           0x0      0
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0

(gdb) info reg
eax            0xffffff9f       -97
ecx            0xbffff500       -1073744640
edx            0x61     97
ebx            0x4007e6d8       1074259672
esp            0xbffff4e4       0xbffff4e4
ebp            0xbffff4ec       0xbffff4ec
esi            0x40016b64       1073834852
edi            0xbffff6ac       -1073744212
eip            0x40073e16       0x40073e16
eflags         0x10206  66054
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x2b     43
gs             0x2b     43
fctrl          0x0      0
fstat          0x0      0
ftag           0x0      0
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0

Comment 3 Jakub Jelinek 2001-08-03 16:24:43 UTC

Then the thing is where %gs got this value from.
If you start a program linked against -lpthread, it should have 0x7 in %gs
(that's for the initial thread), later on it can contain other values for
other threads, but never 0x2b.

Comment 4 Wan-Teh Chang 2001-08-04 14:36:26 UTC

It seems that the 0x2b value in %gs in  the core files cannot be
trusted.

For example, I have a simple test program that saves the value
of %gs in __gs and then  dereferences a null pointer:
% cat foo.c
int main()
{
    int *p = 0;
    unsigned int __gs;

    asm ("mov %%gs, %0" : "=r" (__gs));
    *p = 1;  /* crash */
    return 0;
}

I build it with -pthread and confirm that it is linked with -lpthread:
% gcc -g -pthread foo.c -o foo
% ldd foo
        libpthread.so.0 => /lib/i686/libpthread.so.0 (0x40021000)
        libc.so.6 => /lib/i686/libc.so.6 (0x40036000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
% ./foo
Segmentation fault (core dumped)

Invoking the debugger on the core file, I see 0x7 in __gs but 0x2b in %gs::

% gdb foo core
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `./foo'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/i686/libpthread.so.0...done.

warning: Unable to set global thread event mask: generic error
[New Thread 1024 (LWP 31416)]
Error while reading shared library symbols:
Cannot enable thread event reporting for Thread 1024 (LWP 31416): generic error
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  0x08048457 in main () at foo.c:7
7           *p = 1;  /* crash */
(gdb) p/x __gs
$1 = 0x7
(gdb) p/x $gs
$2 = 0x2b
(gdb)

However, if I run the test program inside the debugger, I see 0x7
in both __gs and %gs:

% gdb foo
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) run
Starting program: /u/wtc/nss-3.3/box-build-all/foo
[New Thread 1024 (LWP 31421)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1024 (LWP 31421)]
0x08048457 in main () at foo.c:7
7           *p = 1;  /* crash */
(gdb) p/x __gs
$1 = 0x7
(gdb) p/x $gs
$2 = 0x7
(gdb)

Comment 5 Wan-Teh Chang 2001-08-04 15:28:21 UTC

I noticed that the top of the stack trace of the core files doesn't
look right. The five function calls at the top are listed here.

#0 __errno_location () at errno.c:25
#1 0x4016a484 in __socket () from /lib/i686/libc.so.6
#2 0x4002f5de in _pr_init_ipv6 () at pripv6.c:330
#3 0x4003e11f in _PR_InitStuff () at prinit.c:244
#4 0x4003e137 in _PR_ImplicitInitialization () at prinit.c:252

Between __socket () and _pr_init_ipv6 (), we should have
socket () and _pr_test_ipv6_socket ().  I reproduce the relevant
source files here.

--- pripv6.c ---
329
330     _pr_ipv6_is_present = _pr_test_ipv6_socket();
331     if (PR_TRUE == _pr_ipv6_is_present)
332             return PR_SUCCESS;

--- ptio.c ---
3103 PR_IMPLEMENT(PRBool) _pr_test_ipv6_socket()
3104 {
3105 PRInt32 osfd;
3106
3107     osfd = socket(AF_INET6, SOCK_STREAM, 0);
3108     if (osfd != -1) {
3109         close(osfd);
3110         return PR_TRUE;
3111     }
3112     return PR_FALSE;
3113 }

We are crashing intermittently inside the socket () call in the
_pr_test_ipv6_socket () function at ptio.c:3107, but socket ()
and _pr_test_ipv6_socket () are not in the stack trace of the
core files.  This is weird.

As you suggested, I added the following assertion before and
after that socket () call at ptio.c:3107.

{
  unsigned int __gs; __asm ("mov %%gs, %0" : "=r" (__gs));
  assert(__gs == 0x7);
}

Our tests still crash intermittently inside that socket () call at
ptio.c:3107, but the assertions never fail.  This means %gs is always
0x7 before we enter that socket () call.  Therefore, if __errno_location ()
can only crash because of a bad value in %gs, I conclude that it must be
socket () or some other function in libc.so.6 called by socket () that
mucks with %gs.  (The tests that crash this way do not create any
threads even though they are linked with -lpthread.)

Comment 6 Wan-Teh Chang 2001-08-04 15:30:11 UTC

Setting LD_ASSUME_KERNEL to 2.2.5 works.  I ran our tests over a
hundred times without any core dumps.

I will looking into trying the latest kernel.  Is there anything else I can do
to help investigate this bug?

Comment 7 Jakub Jelinek 2001-08-20 14:50:49 UTC

Can you try 2.4.7-2 kernels from rawhide?
I believe it should fix the bug.

Comment 8 Wan-Teh Chang 2001-10-22 20:58:16 UTC

Jakub, I still haven't been able to try the 2.4.7-2 kernel
that you suggested.

Red Hat Linux 7.2 is now released.  I learned from Red Hat's
web site that Red Hat Linux 7.2 uses the 2.4.7 kernel.  What
is the relation between the 2.4.7 and 2.4.7-2 kernels?  In
particular, does the kernel in Red Hat Linux 7.2 fix this
bug?

Comment 9 Wan-Teh Chang 2001-11-15 23:49:41 UTC

Jakub, I finally have a dual-CPU PC at my disposal.
I've put stock Red Hat Linux 7.1 on it and reproduced
the core dump.  Now I'm ready to a new kernel to see
if it fixed this bug.

Could you tell me which kernel I should try (on a
7.1 machine) and where I can download the RPM from?

Thanks.

Comment 10 Jakub Jelinek 2001-11-16 07:32:14 UTC

Try the ones in http://www.redhat.com/support/errata/RHSA-2001-142.html
for 7.1.

Comment 11 Wan-Teh Chang 2001-11-20 18:21:27 UTC

I have good news to report.  After installing kernel-smp-2.4.9-12
on my Red Hat Linux 7.1 box (2-processor i686), I haven't been
able to reproduce this crash.

Jakub, what is the kernel bug that causes these crashes?  In
http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.8,
I found this item:

  pre1:
  ...
   - James Washer: LDT loading SMP bug fix

Is this the one?  Is this bug fixed in the 2.4.7-2 kernels
(which you told me to try originally)?

Note You need to log in before you can comment on or make changes to this bug.