Bug 103254

Summary: Kernel crashes on Itanium after few minutes with message indicating compilation errors
Product: Red Hat Enterprise Linux 3 Reporter: Albert Fluegel <tdsc.af>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: knoel
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-05-13 22:29:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 101028    
Attachments:
Description Flags
the binary that core dumps followed by a kernel crash
none
/var/log/messages snippet when doing I/O testing on external disks none

Description Albert Fluegel 2003-08-28 07:15:19 UTC
Description of problem:
kernel 2.4.21-1.1931.2.411 crashes on Itanium after several
seconds to few minutes with the following message on console:
sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400)
kernel BUG at
/usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!
Unable to handle kernel NULL pointer dereferencemelim[3570]: Oops 8804682956800
                                                                               [
     
Pid: 3570, comm:                melim
EIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.411.ent)
psr : 0000101008026018 ifs : 8000000000000e24 ip  : [<e00000000446f6e0>]    Not
tainted
unat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003
rnat: e000000004b6ef90 bsps: e000000004b6ef90 pr  : 8002924155aaaa65
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
b0  : e00000000446f6d0 b6  : e0000000047fa1a0 b7  : e000000004646f40
f6  : 0fffbccccccccc8c00000 f7  : 0ffdcb640000000000000
f8  : 100029000000000000000 f9  : 10002a000000000000000
r1  : e000000004c9fd00 r2  : e0000040e8d5003c r3  : e0000040fe19003c
r8  : 0000000000000066 r9  : e000000004a72990 r10 : 0000000000001300
r11 : 0000000000000001 r12 : e0000040e8d56f50 r13 : e0000040e8d50000
r14 : 0000000000000074 r15 : 0000000000000000 r16 : 0000000000000000
r17 : 0000000000004000 r18 : 0000000000004000 r19 : 0000000000001300
r20 : e000000004a71690 r21 : 0000000000000013 r22 : 0000000000000009
r23 : 0000000000004000 r24 : e000000004a70400 r25 : e000000004b60ad0
r26 : 0000000000000001 r27 : 0000000000000013 r28 : e000000004a72030
r29 : 0000000000000073 r30 : e0000040fe190028 r31 : 0000000000000001

Call Trace: [<e0000000044155c0>] sp=0xe0000040e8d56b60 bsp=0xe0000040e8d51468
show_stack [kernel] 0x80
[<e000000004430410>] sp=0xe0000040e8d56d20 bsp=0xe0000040e8d51438 die [kernel] 0x1b0
[<e000000004451e30>] sp=0xe0000040e8d56d20 bsp=0xe0000040e8d513d8
ia64_do_page_fault [kernel] 0x310
[<e00000000440e680>] sp=0xe0000040e8d56db0 bsp=0xe0000040e8d513d8
ia64_leave_kernel [kernel] 0x0
[<e00000000446f6e0>] sp=0xe0000040e8d56f50 bsp=0xe0000040e8d512b8 elf_core_dump
[kernel] 0x640
[<e00000000452e280>] sp=0xe0000040e8d57d80 bsp=0xe0000040e8d51260 do_coredump
[kernel] 0x500
[<e0000000044a7fb0>] sp=0xe0000040e8d57dd0 bsp=0xe0000040e8d511e8
get_signal_to_deliver [kernel] 0x630
[<e00000000442eab0>] sp=0xe0000040e8d57dd0 bsp=0xe0000040e8d51180 ia64_do_signal
[kernel] 0xd0
[<e00000000440eac0>] sp=0xe0000040e8d57e50 bsp=0xe0000040e8d51130
handle_signal_delivery [kernel] 0x40
[<e00000000440e6f0>] sp=0xe0000040e8d57e60 bsp=0xe0000040e8d51130
ia64_leave_kernel [kernel] 0x70
Kernel panic: Fatal exception
Aug 27 17:41:50 ltuih001 kernel: sizeof(elf_gregset_t) (1024) != sizeof(struct
pt_regs) (400)
Aug 27 17:41:50 ltuih001 kernel: kernel BUG at
/usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!
Aug 27

Seems to me, there is a problem with type declarations. I observed
another problem concerning the header files (sigstack.h type
definitions), probably there is too much changes going on.

Version-Release number of selected component (if applicable):
2.4.21-1.1931.2.411

(had 2.4.21-1.1931.2.399 installed before, what had another problem,
but did not crash that way)

How reproducible:
Install kernel-2.4.21-1.1931.2.411 boot and wait a few minutes

Steps to Reproduce:
1.Install kernel-2.4.21-1.1931.2.411
2.boot
3.wait a few minutes
    
Actual results:
kernel crash (Network interface still alive / pingable, but no process
runs any more, machine can still be booted using sysrq i.e. <BREAK>b
on console, so the kernel is not completely dead

Expected results:
OS and thus processes continue to run

Additional info:

Comment 1 Albert Fluegel 2003-08-28 07:15:50 UTC
Same kernel version works perfectly on Opteron


Comment 2 Bill Nottingham 2003-08-28 16:01:38 UTC
Did you have an app that segfaulted that caused the core dumping code to execute?

Comment 3 Albert Fluegel 2003-08-28 16:20:14 UTC
Created attachment 94039 [details]
the binary that core dumps followed by a kernel crash

This is the melim binary from Platform computing Inc. coming
with the LSF software, version 5.1, see:
http://www.platform.com/products/LSF/

Comment 4 Albert Fluegel 2003-08-29 12:51:36 UTC
Aditional findings: the kernel crash occurs exactly, when this program
gets a SIGTERM. I attached an strace to the process and the last thing i
see is:
strace -f -p 3135^M
Process 3135 attached - interrupt to quit^M
select(0x1, 0xbffffa58, 0, 0xbffff9d8, 0xbffff9cc) = -514^M
--- SIGTERM (Tersizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400)
^Mminated) @ 40016kernel BUG at
/usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!
^M5ce (5009) ---^M
Unable to handle kernel NULL pointer dereferencemelim[3135]: Oops 8804682956800
^M
^MPid: 3135, comm:                melim

and the rest is like already reported.


Comment 5 Albert Fluegel 2003-08-29 15:03:00 UTC
Here's what happens on 2.4.21-1.1931.2.393, the main difference is,
that the machine does not stop working. output on console, if that
melim program gets SIGTERM:
^MIA32 syscall #252 issued, maybe we should implement it
^MAug 29 16:49:47 ltuii002 kernel: IA32 syscall #252 issued, maybe we should
implement it^M
sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400)
^Mkernel BUG at
/usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!
^MUnable to handle kernel NULL pointer dereferencemelim[3468]: Oops 8804682956800
^M
^MPid: 3468, comm:                melim
^MEIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.393.ent)
^Mpsr : 0000101008026018 ifs : 8000000000000e24 ip  : [<e00000000446f260>]   
Not tainted
^Munat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003
^Mrnat: 00000000000000bf bsps: 0000000000000fff pr  : 8002924155aa9967
^Mldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
^Mb0  : e00000000446f250 b6  : e0000000047f80e0 b7  : e0000000047f4fa0
^Mf6  : 0fffbccccccccc8c00000 f7  : 0ffdcb640000000000000
^Mf8  : 100029000000000000000 f9  : 10002a000000000000000
^Mr1  : e000000004c9bd00 r2  : e0000000018a7e60 r3  : 000000000000416a
^Mr8  : 0000000000000066 r9  : 0000000000000000 r10 : 0000000000000000
^Mr11 : e0000000018a0000 r12 : e00000001b05ef50 r13 : e00000001b058000
^Mr14 : 0000000000000001 r15 : 0000000000000000 r16 : e0000000018a7e48
^Mr17 : 0000000000004000 r18 : 0000000000004000 r19 : e000000004b68580
^Mr20 : e000000004abb8e8 r21 : e0000000047f4d60 r22 : 0000000000020000
^Mr23 : e000000004b66d70 r24 : 0000000000000060 r25 : 0000000000000000
^Mr26 : 0000000000000000 r27 : 00000000100000c0 r28 : 0000000000800000
^Mr29 : 0000000000000001 r30 : e000000000025a00 r31 : e000000004b66d70
^M
^MCall Trace: [<e0000000044155c0>] sp=0xe00000001b05eb60 bsp=0xe00000001b059460
show_stack [kernel] 0x80
^M[<e000000004430150>] sp=0xe00000001b05ed20 bsp=0xe00000001b059438 die [kernel]
0x1b0
^M[<e000000004451a70>] sp=0xe00000001b05ed20 bsp=0xe00000001b0593d8
ia64_do_page_fault [kernel] 0x310
^M[<e00000000440e680>] sp=0xe00000001b05edb0 bsp=0xe00000001b0593d8
ia64_leave_kernel [kernel] 0x0
^M[<e00000000446f260>] sp=0xe00000001b05ef50 bsp=0xe00000001b0592b8
elf_core_dump [kernel] 0x640
^M[<e00000000452cae0>] sp=0xe00000001b05fd80 bsp=0xe00000001b059260 do_coredump
[kernel] 0x500
^M[<e0000000044a7810>] sp=0xe00000001b05fdd0 bsp=0xe00000001b0591e8
get_signal_to_deliver [kernel] 0x630
^M[<e00000000442e7f0>] sp=0xe00000001b05fdd0 bsp=0xe00000001b059180
ia64_do_signal [kernel] 0xd0
^M[<e00000000440eac0>] sp=0xe00000001b05fe50 bsp=0xe00000001b059130
handle_signal_delivery [kernel] 0x40
^M[<e00000000440e6f0>] sp=0xe00000001b05fe60 bsp=0xe00000001b059130
ia64_leave_kernel [kernel] 0x70
^M Aug 29 16:49:57 ltuii002 kernel: sizeof(elf_gregset_t) (1024) !=
sizeof(struct pt_regs) (400)^M
Aug 29 16:49:57 ltuii002 kernel: kernel BUG at
/usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!^M
Aug 29 16:49:57 ltuii002 kernel: Unable to handle kernel NULL pointer
dereferencemelim[3468]: Oops 8804682956800^M
Aug 29 16:49:57 ltuii002 kernel: ^M
Aug 29 16:49:57 ltuii002 kernel: Pid: 3468, comm:                melim^M
Aug 29 16:49:57 ltuii002 kernel: EIP is at elf_core_dump [kernel] 0x640
(2.4.21-1.1931.2.393.ent)^M
Aug 29 16:49:57 ltuii002 kernel: psr : 0000101008026018 ifs : 8000000000000e24
ip  : [<e00000000446f260>]    Not tainted^M
Aug 29 16:49:57 ltuii002 kernel: unat: 0000000000000000 pfs : 0000000000000e24
rsc : 0000000000000003^M
Aug 29 16:49:57 ltuii002 kernel: rnat: 00000000000000bf bsps: 0000000000000fff
pr  : 8002924155aa9967^M
Aug 29 16:49:57 ltuii002 kernel: ldrs: 0000000000000000 ccv : 0000000000000000
fpsr: 0009804c8a70033f^M
Aug 29 16:49:57 ltuii002 kernel: b0  : e00000000446f250 b6  : e0000000047f80e0
b7  : e0000000047f4fa0^M
Aug 29 16:49:57 ltuii002 kernel: f6  : 0fffbccccccccc8c00000 f7  :
0ffdcb640000000000000^M
Aug 29 16:49:57 ltuii002 kernel: f8  : 100029000000000000000 f9  :
10002a000000000000000^M
Aug 29 16:49:57 ltuii002 kernel: r1  : e000000004c9bd00 r2  : e0000000018a7e60
r3  : 000000000000416a^M
Aug 29 16:49:57 ltuii002 kernel: r8  : 0000000000000066 r9  : 0000000000000000
r10 : 0000000000000000^M
Aug 29 16:49:57 ltuii002 kernel: r11 : e0000000018a0000 r12 : e00000001b05ef50
r13 : e00000001b058000^M
Aug 29 16:49:57 ltuii002 kernel: r14 : 0000000000000001 r15 : 0000000000000000
r16 : e0000000018a7e48^M
Aug 29 16:49:58 ltuii002 kernel: r17 : 0000000000004000 r18 : 0000000000004000
r19 : e000000004b68580^M
Aug 29 16:49:58 ltuii002 kernel: r20 : e000000004abb8e8 r21 : e0000000047f4d60
r22 : 0000000000020000^M
Aug 29 16:49:58 ltuii002 kernel: r23 : e000000004b66d70 r24 : 0000000000000060
r25 : 0000000000000000^M
Aug 29 16:49:58 ltuii002 kernel: r26 : 0000000000000000 r27 : 00000000100000c0
r28 : 0000000000800000^M
Aug 29 16:49:58 ltuii002 kernel: r29 : 0000000000000001 r30 : e000000000025a00
r31 : e000000004b66d70^M
Aug 29 16:49:58 ltuii002 kernel: ^M
Aug 29 16:49:58 ltuii002 kernel: Call Trace: [<e0000000044155c0>]
sp=0xe00000001b05eb60 bsp=0xe00000001b059460 show_stack [kernel] 0x80^M
Aug 29 16:49:58 ltuii002 kernel: [<e000000004430150>] sp=0xe00000001b05ed20
bsp=0xe00000001b059438 die [kernel] 0x1b0^M
Aug 29 16:49:58 ltuii002 kernel: [<e000000004451a70>] sp=0xe00000001b05ed20
bsp=0xe00000001b0593d8 ia64_do_page_fault [kernel] 0x310^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000440e680>] sp=0xe00000001b05edb0
bsp=0xe00000001b0593d8 ia64_leave_kernel [kernel] 0x0^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000446f260>] sp=0xe00000001b05ef50
bsp=0xe00000001b0592b8 elf_core_dump [kernel] 0x640^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000452cae0>] sp=0xe00000001b05fd80
bsp=0xe00000001b059260 do_coredump [kernel] 0x500^M
Aug 29 16:49:58 ltuii002 kernel: [<e0000000044a7810>] sp=0xe00000001b05fdd0
bsp=0xe00000001b0591e8 get_signal_to_deliver [kernel] 0x630^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000442e7f0>] sp=0xe00000001b05fdd0
bsp=0xe00000001b059180 ia64_do_signal [kernel] 0xd0^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000440eac0>] sp=0xe00000001b05fe50
bsp=0xe00000001b059130 handle_signal_delivery [kernel] 0x40^M
Aug 29 16:49:58 ltuii002 kernel: [<e00000000440e6f0>] sp=0xe00000001b05fe60
bsp=0xe00000001b059130 ia64_leave_kernel [kernel] 0x70^M
<4>IA32 syscall #252 issued, maybe we should implement it
^MAug 29 16:50:10 ltuii002 kernel:  <4>IA32 syscall #252 issued, maybe we should
implement it^M

Could it be it has something to do with the nanosleep 32 Bit implementation ?
I've seen that call one time in gdb just before the machine went down
with .411 kernel:

Program received signal SIGTERM, Terminated.
0x400165ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) s
Single stepping until exit from function _dl_sysinfo_int80, 
which has no line number information.
(now kill <pid>)

Program received signal SIGSEGV, Segmentation fault.
0x400eda8e in nanosleep () from /lib/tls/libc.so.6
(gdb) s
Single stepping until exit from function nanosleep, 
which has no line number information.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) s

BTW it is not possible to strace the process termination
on the .393 kernel. The only things i get:

select(0x1, 0xbffff828, 0, 0xbffff7a8, 0xbffff79c) = -514
--- SIGTERM (Terminated) @ 400165ce (bfa) ---
Process 3450 detached


Comment 6 Albert Fluegel 2003-09-01 11:07:23 UTC
Here's how to reproduce the problem (kernel messages like
reported, but without kernel crash, but in my opinion this
should be sufficient to locate the issue):

Write a trivial program, that immediately dumps core, e.g.:

main()
{
  *((char *) 2) = 5;
}

compile it on a x86 machine (e.g. Xeon) to become an i386
executable, then start it on an Itanium. It is important, that
the coredumpsize resource is not set to 0, so first set it
to unlimited (e.g. for csh: limit coredumpsize unlimited or
for sh: ulimit -c unlimited). Immediately the following
messages appear in the syslog:

Sep  1 12:58:55 ltuii002 kernel: sizeof(elf_gregset_t) (1024) != sizeof(struct
pt_regs) (400)
Sep  1 12:58:55 ltuii002 kernel: kernel BUG at
/usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!
Sep  1 12:58:55 ltuii002 kernel: Unable to handle kernel NULL pointer
dereferences[22128]: Oops 8804682956800
Sep  1 12:58:55 ltuii002 kernel: 
Sep  1 12:58:55 ltuii002 kernel: Pid: 22128, comm:                    s
Sep  1 12:58:55 ltuii002 kernel: EIP is at elf_core_dump [kernel] 0x640
(2.4.21-1.1931.2.393.ent)
Sep  1 12:58:55 ltuii002 kernel: psr : 0000101008026038 ifs : 8000000000000e24
ip  : [<e00000000446f260>]    Not tainted
Sep  1 12:58:55 ltuii002 kernel: unat: 0000000000000000 pfs : 0000000000000e24
rsc : 0000000000000003
Sep  1 12:58:55 ltuii002 kernel: rnat: 00000000000000bf bsps: 0000000000000fff
pr  : 8002924155aa9967
Sep  1 12:58:55 ltuii002 kernel: ldrs: 0000000000000000 ccv : 0000000000000000
fpsr: 0009804c8a70033f
Sep  1 12:58:55 ltuii002 kernel: b0  : e00000000446f250 b6  : e0000000044bc760
b7  : e0000000047f4fa0
Sep  1 12:58:55 ltuii002 kernel: f6  : 0fffbccccccccc8c00000 f7  :
0ffdcb640000000000000
Sep  1 12:58:55 ltuii002 kernel: f8  : 100029000000000000000 f9  :
10002a000000000000000
Sep  1 12:58:55 ltuii002 kernel: r1  : e000000004c9bd00 r2  : e00000003ed57e60
r3  : 000000000001f584
Sep  1 12:58:55 ltuii002 kernel: r8  : 0000000000000066 r9  : 0000000000000000
r10 : 0000000000000000
Sep  1 12:58:55 ltuii002 kernel: r11 : e00000003ed50000 r12 : e00000000d5fef50
r13 : e00000000d5f8000
Sep  1 12:58:55 ltuii002 kernel: r14 : 0000000000000001 r15 : 0000000000000000
r16 : e00000003ed57e48
Sep  1 12:58:55 ltuii002 kernel: r17 : 0000000000004000 r18 : 0000000000004000
r19 : e000000004b68580
Sep  1 12:58:55 ltuii002 kernel: r20 : e000000004abb8e8 r21 : e0000000047f4d60
r22 : 0000000000020000
Sep  1 12:58:55 ltuii002 kernel: r23 : e000000004b66d70 r24 : 0000000000000060
r25 : 0000000000000000
Sep  1 12:58:55 ltuii002 kernel: r26 : 0000000000000000 r27 : 00000000100000c0
r28 : 0000000000800000
Sep  1 12:58:55 ltuii002 kernel: r29 : 0000000000000001 r30 : e000000000025a00
r31 : e000000004b66d70
Sep  1 12:58:55 ltuii002 kernel: 
Sep  1 12:58:55 ltuii002 kernel: Call Trace: [<e0000000044155c0>]
sp=0xe00000000d5feb60 bsp=0xe00000000d5f9460 show_stack [kernel] 0x80
Sep  1 12:58:55 ltuii002 kernel: [<e000000004430150>] sp=0xe00000000d5fed20
bsp=0xe00000000d5f9438 die [kernel] 0x1b0
Sep  1 12:58:55 ltuii002 kernel: [<e000000004451a70>] sp=0xe00000000d5fed20
bsp=0xe00000000d5f93d8 ia64_do_page_fault [kernel] 0x310
Sep  1 12:58:55 ltuii002 kernel: [<e00000000440e680>] sp=0xe00000000d5fedb0
bsp=0xe00000000d5f93d8 ia64_leave_kernel [kernel] 0x0
Sep  1 12:58:55 ltuii002 kernel: [<e00000000446f260>] sp=0xe00000000d5fef50
bsp=0xe00000000d5f92b8 elf_core_dump [kernel] 0x640
Sep  1 12:58:55 ltuii002 kernel: [<e00000000452cae0>] sp=0xe00000000d5ffd80
bsp=0xe00000000d5f9260 do_coredump [kernel] 0x500
Sep  1 12:58:55 ltuii002 kernel: [<e0000000044a7810>] sp=0xe00000000d5ffdd0
bsp=0xe00000000d5f91e8 get_signal_to_deliver [kernel] 0x630
Sep  1 12:58:55 ltuii002 kernel: [<e00000000442e7f0>] sp=0xe00000000d5ffdd0
bsp=0xe00000000d5f9180 ia64_do_signal [kernel] 0xd0
Sep  1 12:58:55 ltuii002 kernel: [<e00000000440eac0>] sp=0xe00000000d5ffe50
bsp=0xe00000000d5f9130 handle_signal_delivery [kernel] 0x40
Sep  1 12:58:55 ltuii002 kernel: [<e00000000440e6f0>] sp=0xe00000000d5ffe60
bsp=0xe00000000d5f9130 ia64_leave_kernel [kernel] 0x70


Comment 7 Albert Fluegel 2003-09-01 14:09:34 UTC
Maybe i'm wrong, AFAIS from the code is, that ia32 core dump is not
really supported under Itanium Linux. So probably it should be
better hardcoded coredumpsize = 0 for now ?


Comment 8 Lucio DiGiovanni 2004-01-27 18:49:56 UTC
Created attachment 97280 [details]
/var/log/messages snippet when doing I/O testing on external disks

System seems to work OK but I am perplexed with these messages clogging up the
system logfile.

Comment 9 Jason Baron 2004-05-13 22:29:31 UTC
this has long since been fixed. pls update the kernel. closing.