Description of problem: On a ia64 machine running a kernel that is based off of 2.6.9-5.0.3 the TotalView debugger crashes when you try to debug a hello world program. We have had two seperate user reports of this problem. One was running version 2.6.9-5.0.3.EL the other was running 2.6.9-5.0.3.101.EC. Both produce strace files exactly like this: ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000000000, NULL) = 282584257676671 ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000000000, NULL) = 282584257676671 ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000001000, NULL) = 0 ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000002000, NULL) = 0 ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000003000, NULL) = 0 ptrace(PTRACE_PEEKDATA, 8845, 0xa000000000004000, <unfinished ...> +++ killed by SIGSEGV +++ We saw the following in /var/log/messages on kernel version 2.6.9-5.0.3.101.EC: Apr 15 13:59:04 shannon kernel: kernel BUG at mm/memory.c:816! Apr 15 13:59:04 shannon kernel: tv6main[3672]: bugcheck! 0 [34] Apr 15 13:59:04 shannon kernel: Modules linked in: ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables nfs nfsd exportfs lockd md5 ipv6 parport_pc lp parport autofs4 gm(U) sunrpc ds yenta_socket pcmcia_core vfat fat dm_mod button joydev ohci_hcd ehci_hcd e100 mii tg3 ext3 jbd mptscsih mptbase sd_mod scsi_mod Apr 15 13:59:04 shannon kernel: Apr 15 13:59:04 shannon kernel: Pid: 3672, CPU 0, comm: tv6main Apr 15 13:59:04 shannon kernel: psr : 0000101008126030 ifs : 80000000000010a8 ip : [<a0000001000e89c0>] Tainted: PF Apr 15 13:59:04 shannon kernel: ip is at get_user_pages+0xb40/0xbe0 Apr 15 13:59:04 shannon kernel: unat: 0000000000000000 pfs : 00000000000010a8 rsc : 0000000000000003 Apr 15 13:59:04 shannon kernel: rnat: e0000040f8cb8000 bsps: e0000040f8cbfb60 pr : 000000000569a969 Apr 15 13:59:04 shannon kernel: ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f Apr 15 13:59:04 shannon kernel: csd : 0000000000000000 ssd : 0000000000000000 Apr 15 13:59:04 shannon kernel: b0 : a0000001000e89c0 b6 : a000000100015c20 b7 : a000000100237a80 Apr 15 13:59:04 shannon kernel: f6 : 1003e0000000000001200 f7 : 1003e8080808080808081 Apr 15 13:59:04 shannon kernel: f8 : 1003e00000000000023dc f9 : 1003e000000000e580000 Apr 15 13:59:04 shannon kernel: f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf Apr 15 13:59:04 shannon kernel: r1 : a00000010096ae80 r2 : 0000000000001000 r3 : 0000000000001000 Apr 15 13:59:04 shannon kernel: r8 : 000000000000001f r9 : 00000000000000fd r10 : a00000010077cb00 Apr 15 13:59:04 shannon kernel: r11 : 0000000000000100 r12 : e00000405190fb60 r13 : e000004051908000 Apr 15 13:59:04 shannon kernel: r14 : 0000000000004000 r15 : a0000001007028c0 r16 : a0000001007028c8 Apr 15 13:59:04 shannon kernel: r17 : e00000003e0efde8 r18 : a000000100797c50 r19 : a000000100797c50 Apr 15 13:59:04 shannon kernel: r20 : 0000000000000004 r21 : 0000000000000000 r22 : 0000000000000000 Apr 15 13:59:04 shannon kernel: r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000004 Apr 15 13:59:04 shannon kernel: r26 : e000000001448dd0 r27 : 0000000000000000 r28 : e000004051908dd4 Apr 15 13:59:04 shannon kernel: r29 : e000000001448dd4 r30 : e00000003e0e802c r31 : e0000000010145f0 Apr 15 13:59:04 shannon kernel: Apr 15 13:59:04 shannon kernel: Call Trace: Apr 15 13:59:04 shannon kernel: [<a000000100016a40>] show_stack +0x80/0xa0 Apr 15 13:59:04 shannon kernel: sp=e00000405190f710 bsp=e000004051909378 Apr 15 13:59:04 shannon kernel: [<a000000100017350>] show_regs +0x890/0x8c0 Apr 15 13:59:04 shannon kernel: sp=e00000405190f8e0 bsp=e000004051909330 Apr 15 13:59:04 shannon kernel: [<a00000010003c970>] die+0x150/0x240 Apr 15 13:59:04 shannon kernel: sp=e00000405190f900 bsp=e0000040519092f0 Apr 15 13:59:04 shannon kernel: [<a00000010003caa0>] die_if_kernel +0x40/0x60 Apr 15 13:59:04 shannon kernel: sp=e00000405190f900 bsp=e0000040519092c0 Apr 15 13:59:04 shannon kernel: [<a00000010003cef0>] ia64_bad_break +0x430/0x4c0 Apr 15 13:59:04 shannon kernel: sp=e00000405190f900 bsp=e000004051909298 Apr 15 13:59:04 shannon kernel: [<a00000010000f480>] ia64_leave_kernel +0x0/0x260 Apr 15 13:59:04 shannon kernel: sp=e00000405190f990 bsp=e000004051909298 Apr 15 13:59:04 shannon kernel: [<a0000001000e89c0>] get_user_pages +0xb40/0xbe0 Apr 15 13:59:04 shannon kernel: sp=e00000405190fb60 bsp=e000004051909150 Apr 15 13:59:04 shannon kernel: [<a000000100084990>] access_process_vm +0x130/0x420 Apr 15 13:59:04 shannon kernel: sp=e00000405190fb90 bsp=e0000040519090b0 Apr 15 13:59:04 shannon kernel: [<a0000001000300c0>] ia64_peek +0x80/0x4a0 Apr 15 13:59:04 shannon kernel: sp=e00000405190fbb0 bsp=e000004051909070 Apr 15 13:59:04 shannon kernel: [<a000000100033960>] sys_ptrace +0x780/0x16a0 Apr 15 13:59:04 shannon kernel: sp=e00000405190fbc0 bsp=e000004051908fb0 Apr 15 13:59:04 shannon kernel: [<a00000010000f320>] ia64_ret_from_syscall+0x0/0x20 Apr 15 13:59:04 shannon kernel: sp=e00000405190fe30 bsp=e000004051908fb0 Apr 15 13:59:04 shannon kernel: [<a000000000010640>] 0xa000000000010640 Apr 15 13:59:04 shannon kernel: sp=e000004051910000 bsp=e000004051908fb0 Version-Release number of selected component (if applicable): kernel-2.6.9-5.0.3.EL How reproducible: 1. compile this simple test code. /* this is b.c file */ #include <stdio.h> int main() { char *p; p = malloc(12000); return 0; } % gcc -g -O0 -L/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/lib -ltvheap -Wl,-rpath,/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/lib -o b b.c 2. run totalview on b # ./totalviewcli b Linux IA64 TotalView 6.7.0-2 Copyright 1999-2005 by Etnus, LLC. ALL RIGHTS RESERVED. Copyright 1999 by Etnus, Inc. Copyright 1996-1998 by Dolphin Interconnect Solutions, Inc. Copyright 1989-1996 by BBN Inc. Reading symbols for process 1, executing "b" Library /usr/local/toolworks/totalview.6.7.0-2/linux-ia64/bin/b, with 2 asects, was linked at 0x 4000000000000000, and initially loaded at 0xff00000040000000 Mapping 593 bytes of ELF string data from '/usr/local/toolworks/totalview.6.7.0-2/linux-ia64/bin /b'...done Skimming 335 bytes of DWARF '.debug_info' symbols from '/usr/local/toolworks/totalview.6.7.0-2/l inux-ia64/bin/b'...done Segmentation fault Actual results: The debugger is killed with a segfault Expected results: The debugger should not have been killed with a segfault Additional info:
Created attachment 113337 [details] 1 of 2 trace files We have two customer generated strace files for this issue. This is 1 of 2.
Created attachment 113338 [details] strace file 2 of 2 we have two customer generated strace files that document this problem. This is 2 of 2.
What is the 'gm' module ? Does it still happen without that having been loaded ?
The gm module is the mpich-gm module. It is a special purpose communication module for the myrinet high speed interconnect. However -- the test case is basically a hello world (with a malloc) -- so I doubt that the GM module is needed to reproduce this.
Dave, the gm module is not required to reproduce this. We build a vanilla (newly installed -- not updated at all) RHEL 4 system and were able to reproduce this right away. Thanks, Chris
Furthermore -- we appear to be able to reproduce this in GDB with the following command print *(long *)a000000000004000
ok. I have a good idea what the problem is. I'll post a fix as soon as i can. thanks.
Jason, Excellent! Thanks for keeping us appraised of your progress. Looking forward to testing the fix. Cheers, Chris
Created attachment 113571 [details] restrict in_gate_area() This patch resolved this issue for me in limited testing. I've posted it to the linux-ia64 list for further feedback. I wouldn't have a chance to build a test kernel for distribution with this patch until next week. But feel free to test it. thanks.
Jason, your patch seems to indicate that it was generated against a 2.6.9 kernel. We downloaded a vanilla 2.6.9 kernel and applied your patch with only offsets. We built both a vanilla (unpatched 2.6.9) and patched kernel. However I can't reproduce the problem with the vanilla kernel. Were you working from a vanilla kernel? If not were you working from some sort of known baseline that we can replicate (such as a SRPM kernel)? Did you verify the existance of the bug before applying your patch and verify that it was gone in your patched kernel? It might be useful to know that we have seen it in the 2.6.9 EL kernel listed above and also the vanilla 2.6.11 kernel version. Thanks, Chris
hi Chris, The patch was indeed against 2.6.9, but specifically the Red Hat RHEL4 kernel, which is 2.6.9 based. I did indeed verfiy that the issue existed before the patch, and was fixed by the patch. I will post a link to rhel4 kernel sources and binaries with this patch later today. thanks, -Jason
Jason, Ok thanks! Cheers, Chris
hi Chris, I've placed test kernels and an SRPM with the patch at: http://people.redhat.com/~jbaron/2.6.9-6.39.EL.gate.1.jbaron/ Please let me know if this resolves the issue. thanks, -Jason
Created attachment 113699 [details] test program Here is a simple test program that i used in validating the bug fix.
Jason, Thanks for the rpms -- we will try these out and get back to you ASAP. Cheers, Chris
Jason, Thanks for your patience. We've had the updated kernel installed for a while here and as far as I can tell the fix looks good. I'm coordinating with another engineer here and I wanted to wai to hear what he said before getting back to you. I don't see anything wrong with this fix. What are the next steps for getting this scheduled into something that our mutual customers can use? Cheers, Chris
Ok. I heard back from the other engineer here at Etnus who is watching this issue. He is happy with the fix but he wanted me to ask one side question: "One thing I noted the, /proc/pid/maps still shows the address range 0xa000000000000000-0xa000000000020000. Can you ask them if this is correct? I would think that it would show 0xa000000000000000-0xa000000000004000." Is the range listed in /proc/pid/maps correct? Cheers, Chris
Jason, What are the next steps for getting this into a kernel revision that our mutual customers can use? Hav you looked at all into the address discrepancy in /proc/<pid>/maps? Cheers, Chris
Hi Chris, i was out sick last week :( This problem should be addressed in U2. If you need a fix sooner, you can contact Red Hat support. I'll look into what to do about the discrpency. thanks.
Created attachment 115224 [details] map holes ot 0 New patch for this issue.
Also, the address range in /proc/pid/maps is correct. The GATE PAGE is mapped twice and covers 8 pages.
Devel ACK
Jason, Thanks. Should we test out the attached patch? Will there be any kind of a pre-release of update 2 for us to look at? Cheers, Chris
hi Chris, Feel free to test the above patch, or i hope to have this patch integrated into a U2 pre-release shortly. I'll point you at that. either way. thanks, -Jason
Jason, What is the status of this fix? We are seeing a report of a similar problem (though the output on the system console is a little different -- it says kernel panic) in an ia64 linux system running RHEL 4 update 1 kernel version 2.6.9-11.EL. Did this get into the RHEL 4 update 2 release stream? Is it available via RHN? Cheers, Chris
hi Chris, yes, this in rhel4 u2 beta. This is a available via the rhn beta channel, or you can just grab the kernel from: http://people.redhat.com/~jbaron/rhel4/ The release should be official in a copule weeks. thanks.
Thanks
Chris, Please let us know how you make out. Thanks.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html