Bug 155278 - Debugger killed by kernel when looking at the lowest addressed vmalloc page
Debugger killed by kernel when looking at the lowest addressed vmalloc page
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Jason Baron
Brian Brock
:
Depends On:
Blocks: 156322
  Show dependency treegraph
 
Reported: 2005-04-18 15:01 EDT by Chris Gottbrath
Modified: 2013-03-06 00:58 EST (History)
3 users (show)

See Also:
Fixed In Version: RHSA-2005-514
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-10-05 09:01:22 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
1 of 2 trace files (235.89 KB, text/plain)
2005-04-18 15:03 EDT, Chris Gottbrath
no flags Details
strace file 2 of 2 (781.66 KB, text/plain)
2005-04-18 15:04 EDT, Chris Gottbrath
no flags Details
restrict in_gate_area() (1.32 KB, patch)
2005-04-22 15:39 EDT, Jason Baron
no flags Details | Diff
test program (936 bytes, text/plain)
2005-04-26 22:34 EDT, Jason Baron
no flags Details
map holes ot 0 (1.12 KB, patch)
2005-06-08 11:50 EDT, Jason Baron
no flags Details | Diff

  None (edit)
Description Chris Gottbrath 2005-04-18 15:01:50 EDT
Description of problem:

On a ia64 machine running a kernel that is based off of 2.6.9-5.0.3 the
TotalView debugger crashes when you try to debug a hello world program.

We have had two seperate user reports of this problem.

One was running version 2.6.9-5.0.3.EL the other was running 2.6.9-5.0.3.101.EC.
Both produce strace files exactly like this:

ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000000000, NULL) = 282584257676671
ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000000000, NULL) = 282584257676671
ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000001000, NULL) = 0
ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000002000, NULL) = 0
ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000003000, NULL) = 0
ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000004000,  <unfinished ...>
+++ killed by SIGSEGV +++

We saw the following in /var/log/messages on kernel version 2.6.9-5.0.3.101.EC:

Apr 15 13:59:04 shannon kernel:  kernel BUG at mm/memory.c:816!
Apr 15 13:59:04 shannon kernel: tv6main[3672]: bugcheck! 0 [34]
Apr 15 13:59:04 shannon kernel: Modules linked in: ipt_REJECT ipt_state
ip_conntrack iptable_filter ip_tables nfs nfsd exportfs lockd md5 ipv6
parport_pc lp parport autofs4 gm(U) sunrpc ds yenta_socket pcmcia_core
vfat fat dm_mod button joydev ohci_hcd ehci_hcd e100 mii tg3 ext3 jbd
mptscsih mptbase sd_mod scsi_mod
Apr 15 13:59:04 shannon kernel:
Apr 15 13:59:04 shannon kernel: Pid: 3672, CPU 0, comm:
tv6main
Apr 15 13:59:04 shannon kernel: psr : 0000101008126030 ifs :
80000000000010a8 ip  : [<a0000001000e89c0>]    Tainted: PF
Apr 15 13:59:04 shannon kernel: ip is at get_user_pages+0xb40/0xbe0
Apr 15 13:59:04 shannon kernel: unat: 0000000000000000 pfs :
00000000000010a8 rsc : 0000000000000003
Apr 15 13:59:04 shannon kernel: rnat: e0000040f8cb8000 bsps:
e0000040f8cbfb60 pr  : 000000000569a969
Apr 15 13:59:04 shannon kernel: ldrs: 0000000000000000 ccv :
0000000000000000 fpsr: 0009804c8a70033f
Apr 15 13:59:04 shannon kernel: csd : 0000000000000000 ssd :
0000000000000000
Apr 15 13:59:04 shannon kernel: b0  : a0000001000e89c0 b6  :
a000000100015c20 b7  : a000000100237a80
Apr 15 13:59:04 shannon kernel: f6  : 1003e0000000000001200 f7  :
1003e8080808080808081
Apr 15 13:59:04 shannon kernel: f8  : 1003e00000000000023dc f9  :
1003e000000000e580000
Apr 15 13:59:04 shannon kernel: f10 : 1003e00000000356f424c f11 :
1003e44b831eee7285baf
Apr 15 13:59:04 shannon kernel: r1  : a00000010096ae80 r2  :
0000000000001000 r3  : 0000000000001000
Apr 15 13:59:04 shannon kernel: r8  : 000000000000001f r9  :
00000000000000fd r10 : a00000010077cb00
Apr 15 13:59:04 shannon kernel: r11 : 0000000000000100 r12 :
e00000405190fb60 r13 : e000004051908000
Apr 15 13:59:04 shannon kernel: r14 : 0000000000004000 r15 :
a0000001007028c0 r16 : a0000001007028c8
Apr 15 13:59:04 shannon kernel: r17 : e00000003e0efde8 r18 :
a000000100797c50 r19 : a000000100797c50
Apr 15 13:59:04 shannon kernel: r20 : 0000000000000004 r21 :
0000000000000000 r22 : 0000000000000000
Apr 15 13:59:04 shannon kernel: r23 : 0000000000000000 r24 :
0000000000000000 r25 : 0000000000000004
Apr 15 13:59:04 shannon kernel: r26 : e000000001448dd0 r27 :
0000000000000000 r28 : e000004051908dd4
Apr 15 13:59:04 shannon kernel: r29 : e000000001448dd4 r30 :
e00000003e0e802c r31 : e0000000010145f0
Apr 15 13:59:04 shannon kernel:
Apr 15 13:59:04 shannon kernel: Call Trace:
Apr 15 13:59:04 shannon kernel:  [<a000000100016a40>] show_stack
+0x80/0xa0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f710 bsp=e000004051909378
Apr 15 13:59:04 shannon kernel:  [<a000000100017350>] show_regs
+0x890/0x8c0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f8e0 bsp=e000004051909330
Apr 15 13:59:04 shannon kernel:  [<a00000010003c970>] die+0x150/0x240
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f900 bsp=e0000040519092f0
Apr 15 13:59:04 shannon kernel:  [<a00000010003caa0>] die_if_kernel
+0x40/0x60
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f900 bsp=e0000040519092c0
Apr 15 13:59:04 shannon kernel:  [<a00000010003cef0>] ia64_bad_break
+0x430/0x4c0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f900 bsp=e000004051909298
Apr 15 13:59:04 shannon kernel:  [<a00000010000f480>] ia64_leave_kernel
+0x0/0x260
Apr 15 13:59:04 shannon kernel:
sp=e00000405190f990 bsp=e000004051909298
Apr 15 13:59:04 shannon kernel:  [<a0000001000e89c0>] get_user_pages
+0xb40/0xbe0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190fb60 bsp=e000004051909150
Apr 15 13:59:04 shannon kernel:  [<a000000100084990>] access_process_vm
+0x130/0x420
Apr 15 13:59:04 shannon kernel:
sp=e00000405190fb90 bsp=e0000040519090b0
Apr 15 13:59:04 shannon kernel:  [<a0000001000300c0>] ia64_peek
+0x80/0x4a0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190fbb0 bsp=e000004051909070
Apr 15 13:59:04 shannon kernel:  [<a000000100033960>] sys_ptrace
+0x780/0x16a0
Apr 15 13:59:04 shannon kernel:
sp=e00000405190fbc0 bsp=e000004051908fb0
Apr 15 13:59:04 shannon kernel:  [<a00000010000f320>]
ia64_ret_from_syscall+0x0/0x20
Apr 15 13:59:04 shannon kernel:
sp=e00000405190fe30 bsp=e000004051908fb0
Apr 15 13:59:04 shannon kernel:  [<a000000000010640>] 0xa000000000010640
Apr 15 13:59:04 shannon kernel:
sp=e000004051910000 bsp=e000004051908fb0





Version-Release number of selected component (if applicable):

kernel-2.6.9-5.0.3.EL


How reproducible:

1. compile this simple test code.

/* this is b.c file */
#include <stdio.h>

int main()
{
       char *p;

       p = malloc(12000);
       return 0;
}


% gcc -g -O0 -L/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/lib 
-ltvheap -Wl,-rpath,/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/lib 
-o b b.c


2. run totalview on b

# ./totalviewcli b
Linux IA64 TotalView 6.7.0-2
Copyright 1999-2005 by Etnus, LLC. ALL RIGHTS RESERVED.
Copyright 1999 by Etnus, Inc.
Copyright 1996-1998 by Dolphin Interconnect Solutions, Inc.
Copyright 1989-1996 by BBN Inc.
Reading symbols for process 1, executing "b"
Library /usr/local/toolworks/totalview.6.7.0-2/linux-ia64/bin/b, with 2 asects,
was linked at 0x
4000000000000000, and initially loaded at 0xff00000040000000
Mapping 593 bytes of ELF string data from
'/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/bin
/b'...done
Skimming 335 bytes of DWARF '.debug_info' symbols from
'/usr/local/toolworks/totalview.6.7.0-2/l
inux-ia64/bin/b'...done
Segmentation fault




  
Actual results:

The debugger is killed with a segfault

Expected results:

The debugger should not have been killed with a segfault

Additional info:
Comment 1 Chris Gottbrath 2005-04-18 15:03:08 EDT
Created attachment 113337 [details]
1 of 2 trace files

We have two customer generated strace files for this issue. This is 1 of 2.
Comment 2 Chris Gottbrath 2005-04-18 15:04:13 EDT
Created attachment 113338 [details]
strace file 2 of 2 

we have two customer generated strace files that document this problem. This is
2 of 2.
Comment 3 Dave Jones 2005-04-18 15:13:33 EDT
What is the 'gm' module ? Does it still happen without that having been loaded ?
Comment 4 Chris Gottbrath 2005-04-18 15:24:34 EDT
The gm module is the mpich-gm module. It is a special purpose communication
module for the myrinet high speed interconnect.

However -- the test case is basically a hello world (with a malloc) -- so I
doubt that the GM module is needed to reproduce this. 



Comment 5 Chris Gottbrath 2005-04-20 14:19:41 EDT
Dave, 

the gm module is not required to reproduce this. We build a vanilla (newly
installed -- not updated at all) RHEL 4 system and were able to reproduce 
this right away. 

Thanks, 
Chris
Comment 6 Chris Gottbrath 2005-04-20 14:35:57 EDT
Furthermore -- we appear to be able to reproduce this in GDB with the 
following command

print *(long *)a000000000004000

Comment 7 Jason Baron 2005-04-20 18:10:42 EDT
ok. I have a good idea what the problem is. I'll post a fix as soon as i can.
thanks.
Comment 8 Chris Gottbrath 2005-04-20 18:13:10 EDT
Jason, 

Excellent! Thanks for keeping us appraised of your progress. 

Looking forward to testing the fix.

Cheers,
Chris
Comment 9 Jason Baron 2005-04-22 15:39:05 EDT
Created attachment 113571 [details]
restrict in_gate_area()

This patch resolved this issue for me in limited testing. I've posted it to the
linux-ia64 list for further feedback. I wouldn't have a chance to build a test
kernel for distribution with this patch until next week. But feel free to test
it. thanks.
Comment 10 Chris Gottbrath 2005-04-25 16:29:43 EDT
Jason, 

your patch seems to indicate that it was generated against a 2.6.9 kernel. We
downloaded a vanilla 2.6.9 kernel and applied your patch with only offsets.

We built both a vanilla (unpatched 2.6.9) and patched kernel. However I can't 
reproduce the problem with the vanilla kernel. 

Were you working from a vanilla kernel? If not were you working from some
sort of known baseline that we can replicate (such as a SRPM kernel)?

Did you verify the existance of the bug before applying your patch and verify
that it was gone in your patched kernel?

It might be useful to know that we have seen it in the 2.6.9 EL kernel listed
above and also the vanilla 2.6.11 kernel version.

Thanks,
Chris
Comment 11 Jason Baron 2005-04-26 09:17:53 EDT
hi Chris, 

The patch was indeed against 2.6.9, but specifically the Red Hat RHEL4 kernel,
which is 2.6.9 based. I did indeed verfiy that the issue existed before the
patch, and was fixed by the patch. I will post a link to rhel4 kernel sources
and binaries with this patch later today.

thanks,
-Jason
Comment 12 Chris Gottbrath 2005-04-26 10:14:41 EDT
Jason, 

Ok thanks! 

Cheers,
Chris
Comment 13 Jason Baron 2005-04-26 22:32:09 EDT
hi Chris,

I've placed test kernels and an SRPM with the patch at:
http://people.redhat.com/~jbaron/2.6.9-6.39.EL.gate.1.jbaron/

Please let me know if this resolves the issue.

thanks,

-Jason
Comment 14 Jason Baron 2005-04-26 22:34:20 EDT
Created attachment 113699 [details]
test program

Here is a simple test program that i used in validating the bug fix.
Comment 15 Chris Gottbrath 2005-04-27 10:00:39 EDT
Jason, 

Thanks for the rpms -- we will try these out and get back to you 
ASAP. 

Cheers,
Chris
Comment 16 Chris Gottbrath 2005-05-03 16:36:50 EDT
Jason, 

Thanks for your patience.

We've had the updated kernel installed for a while here and as far as
I can tell the fix looks good. I'm coordinating with another engineer here
and I wanted to wai to hear what he said before getting back to you. 

I don't see anything wrong with this fix. What are the next steps for 
getting this scheduled into something that our mutual customers can use?

Cheers,
Chris
Comment 17 Chris Gottbrath 2005-05-03 17:58:19 EDT
Ok. I heard back from the other engineer here at Etnus who is watching this
issue.  He is happy with the fix but he wanted me to ask one side question:

"One thing I noted the, /proc/pid/maps still shows the address range
0xa000000000000000-0xa000000000020000. Can you ask them if this is correct?
I would think that it would show 0xa000000000000000-0xa000000000004000."

Is the range listed in /proc/pid/maps correct?

Cheers,
Chris
Comment 18 Chris Gottbrath 2005-05-16 14:57:51 EDT
Jason, 

What are the next steps for getting this into a kernel revision 
that our mutual customers can use?

Hav you looked at all into the address discrepancy in /proc/<pid>/maps?

Cheers,
Chris
Comment 19 Jason Baron 2005-05-16 15:09:23 EDT
Hi Chris, i was out sick last week :(

This problem should be addressed in U2. If you need a fix sooner, you can
contact Red Hat support. I'll look into what to do about the discrpency. thanks.
Comment 22 Jason Baron 2005-06-08 11:50:49 EDT
Created attachment 115224 [details]
map holes ot 0

New patch for this issue.
Comment 23 Jason Baron 2005-06-08 11:58:37 EDT
Also, the address range in /proc/pid/maps is correct. The GATE PAGE is mapped
twice and covers 8 pages.
Comment 24 Linda Wang 2005-06-16 14:57:38 EDT
Devel ACK
Comment 25 Chris Gottbrath 2005-06-17 18:15:50 EDT
Jason,

Thanks. 

Should we test out the attached patch? Will there be any 
kind of a pre-release of update 2 for us to look at?

Cheers,
Chris
Comment 26 Jason Baron 2005-06-20 14:37:49 EDT
hi Chris,

Feel free to test the above patch, or i hope to have this patch integrated into
a U2 pre-release shortly. I'll point you at that. either way.

thanks,

-Jason
Comment 30 Chris Gottbrath 2005-09-14 14:44:43 EDT
Jason, 

What is the status of this fix? We are seeing a report of a similar problem
(though the output on the system console is a little different -- it says kernel
panic) in an ia64 linux system running RHEL 4 update 1 kernel version  2.6.9-11.EL.


Did this get into the RHEL 4 update 2 release stream? Is it available via RHN?

Cheers,
Chris
Comment 31 Jason Baron 2005-09-14 14:50:46 EDT
hi Chris,

yes, this in rhel4 u2 beta. This is a available via the rhn beta channel, or you
can just grab the kernel from: http://people.redhat.com/~jbaron/rhel4/ The
release should be official in a copule weeks.

thanks.
Comment 32 Chris Gottbrath 2005-09-14 15:04:33 EDT
Thanks
Comment 33 John Poelstra 2005-09-16 16:21:29 EDT
Chris,

Please let us know how you make out. Thanks.
Comment 35 Red Hat Bugzilla 2005-10-05 09:01:22 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html

Note You need to log in before you can comment on or make changes to this bug.