Description of problem: kernel 2.4.21-1.1931.2.411 crashes on Itanium after several seconds to few minutes with the following message on console: sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400) kernel BUG at /usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94! Unable to handle kernel NULL pointer dereferencemelim[3570]: Oops 8804682956800 [ Pid: 3570, comm: melim EIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.411.ent) psr : 0000101008026018 ifs : 8000000000000e24 ip : [<e00000000446f6e0>] Not tainted unat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003 rnat: e000000004b6ef90 bsps: e000000004b6ef90 pr : 8002924155aaaa65 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f b0 : e00000000446f6d0 b6 : e0000000047fa1a0 b7 : e000000004646f40 f6 : 0fffbccccccccc8c00000 f7 : 0ffdcb640000000000000 f8 : 100029000000000000000 f9 : 10002a000000000000000 r1 : e000000004c9fd00 r2 : e0000040e8d5003c r3 : e0000040fe19003c r8 : 0000000000000066 r9 : e000000004a72990 r10 : 0000000000001300 r11 : 0000000000000001 r12 : e0000040e8d56f50 r13 : e0000040e8d50000 r14 : 0000000000000074 r15 : 0000000000000000 r16 : 0000000000000000 r17 : 0000000000004000 r18 : 0000000000004000 r19 : 0000000000001300 r20 : e000000004a71690 r21 : 0000000000000013 r22 : 0000000000000009 r23 : 0000000000004000 r24 : e000000004a70400 r25 : e000000004b60ad0 r26 : 0000000000000001 r27 : 0000000000000013 r28 : e000000004a72030 r29 : 0000000000000073 r30 : e0000040fe190028 r31 : 0000000000000001 Call Trace: [<e0000000044155c0>] sp=0xe0000040e8d56b60 bsp=0xe0000040e8d51468 show_stack [kernel] 0x80 [<e000000004430410>] sp=0xe0000040e8d56d20 bsp=0xe0000040e8d51438 die [kernel] 0x1b0 [<e000000004451e30>] sp=0xe0000040e8d56d20 bsp=0xe0000040e8d513d8 ia64_do_page_fault [kernel] 0x310 [<e00000000440e680>] sp=0xe0000040e8d56db0 bsp=0xe0000040e8d513d8 ia64_leave_kernel [kernel] 0x0 [<e00000000446f6e0>] sp=0xe0000040e8d56f50 bsp=0xe0000040e8d512b8 elf_core_dump [kernel] 0x640 [<e00000000452e280>] sp=0xe0000040e8d57d80 bsp=0xe0000040e8d51260 do_coredump [kernel] 0x500 [<e0000000044a7fb0>] sp=0xe0000040e8d57dd0 bsp=0xe0000040e8d511e8 get_signal_to_deliver [kernel] 0x630 [<e00000000442eab0>] sp=0xe0000040e8d57dd0 bsp=0xe0000040e8d51180 ia64_do_signal [kernel] 0xd0 [<e00000000440eac0>] sp=0xe0000040e8d57e50 bsp=0xe0000040e8d51130 handle_signal_delivery [kernel] 0x40 [<e00000000440e6f0>] sp=0xe0000040e8d57e60 bsp=0xe0000040e8d51130 ia64_leave_kernel [kernel] 0x70 Kernel panic: Fatal exception Aug 27 17:41:50 ltuih001 kernel: sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400) Aug 27 17:41:50 ltuih001 kernel: kernel BUG at /usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94! Aug 27 Seems to me, there is a problem with type declarations. I observed another problem concerning the header files (sigstack.h type definitions), probably there is too much changes going on. Version-Release number of selected component (if applicable): 2.4.21-1.1931.2.411 (had 2.4.21-1.1931.2.399 installed before, what had another problem, but did not crash that way) How reproducible: Install kernel-2.4.21-1.1931.2.411 boot and wait a few minutes Steps to Reproduce: 1.Install kernel-2.4.21-1.1931.2.411 2.boot 3.wait a few minutes Actual results: kernel crash (Network interface still alive / pingable, but no process runs any more, machine can still be booted using sysrq i.e. <BREAK>b on console, so the kernel is not completely dead Expected results: OS and thus processes continue to run Additional info:
Same kernel version works perfectly on Opteron
Did you have an app that segfaulted that caused the core dumping code to execute?
Created attachment 94039 [details] the binary that core dumps followed by a kernel crash This is the melim binary from Platform computing Inc. coming with the LSF software, version 5.1, see: http://www.platform.com/products/LSF/
Aditional findings: the kernel crash occurs exactly, when this program gets a SIGTERM. I attached an strace to the process and the last thing i see is: strace -f -p 3135^M Process 3135 attached - interrupt to quit^M select(0x1, 0xbffffa58, 0, 0xbffff9d8, 0xbffff9cc) = -514^M --- SIGTERM (Tersizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400) ^Mminated) @ 40016kernel BUG at /usr/src/build/297471-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94! ^M5ce (5009) ---^M Unable to handle kernel NULL pointer dereferencemelim[3135]: Oops 8804682956800 ^M ^MPid: 3135, comm: melim and the rest is like already reported.
Here's what happens on 2.4.21-1.1931.2.393, the main difference is, that the machine does not stop working. output on console, if that melim program gets SIGTERM: ^MIA32 syscall #252 issued, maybe we should implement it ^MAug 29 16:49:47 ltuii002 kernel: IA32 syscall #252 issued, maybe we should implement it^M sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400) ^Mkernel BUG at /usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94! ^MUnable to handle kernel NULL pointer dereferencemelim[3468]: Oops 8804682956800 ^M ^MPid: 3468, comm: melim ^MEIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.393.ent) ^Mpsr : 0000101008026018 ifs : 8000000000000e24 ip : [<e00000000446f260>] Not tainted ^Munat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003 ^Mrnat: 00000000000000bf bsps: 0000000000000fff pr : 8002924155aa9967 ^Mldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f ^Mb0 : e00000000446f250 b6 : e0000000047f80e0 b7 : e0000000047f4fa0 ^Mf6 : 0fffbccccccccc8c00000 f7 : 0ffdcb640000000000000 ^Mf8 : 100029000000000000000 f9 : 10002a000000000000000 ^Mr1 : e000000004c9bd00 r2 : e0000000018a7e60 r3 : 000000000000416a ^Mr8 : 0000000000000066 r9 : 0000000000000000 r10 : 0000000000000000 ^Mr11 : e0000000018a0000 r12 : e00000001b05ef50 r13 : e00000001b058000 ^Mr14 : 0000000000000001 r15 : 0000000000000000 r16 : e0000000018a7e48 ^Mr17 : 0000000000004000 r18 : 0000000000004000 r19 : e000000004b68580 ^Mr20 : e000000004abb8e8 r21 : e0000000047f4d60 r22 : 0000000000020000 ^Mr23 : e000000004b66d70 r24 : 0000000000000060 r25 : 0000000000000000 ^Mr26 : 0000000000000000 r27 : 00000000100000c0 r28 : 0000000000800000 ^Mr29 : 0000000000000001 r30 : e000000000025a00 r31 : e000000004b66d70 ^M ^MCall Trace: [<e0000000044155c0>] sp=0xe00000001b05eb60 bsp=0xe00000001b059460 show_stack [kernel] 0x80 ^M[<e000000004430150>] sp=0xe00000001b05ed20 bsp=0xe00000001b059438 die [kernel] 0x1b0 ^M[<e000000004451a70>] sp=0xe00000001b05ed20 bsp=0xe00000001b0593d8 ia64_do_page_fault [kernel] 0x310 ^M[<e00000000440e680>] sp=0xe00000001b05edb0 bsp=0xe00000001b0593d8 ia64_leave_kernel [kernel] 0x0 ^M[<e00000000446f260>] sp=0xe00000001b05ef50 bsp=0xe00000001b0592b8 elf_core_dump [kernel] 0x640 ^M[<e00000000452cae0>] sp=0xe00000001b05fd80 bsp=0xe00000001b059260 do_coredump [kernel] 0x500 ^M[<e0000000044a7810>] sp=0xe00000001b05fdd0 bsp=0xe00000001b0591e8 get_signal_to_deliver [kernel] 0x630 ^M[<e00000000442e7f0>] sp=0xe00000001b05fdd0 bsp=0xe00000001b059180 ia64_do_signal [kernel] 0xd0 ^M[<e00000000440eac0>] sp=0xe00000001b05fe50 bsp=0xe00000001b059130 handle_signal_delivery [kernel] 0x40 ^M[<e00000000440e6f0>] sp=0xe00000001b05fe60 bsp=0xe00000001b059130 ia64_leave_kernel [kernel] 0x70 ^M Aug 29 16:49:57 ltuii002 kernel: sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400)^M Aug 29 16:49:57 ltuii002 kernel: kernel BUG at /usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94!^M Aug 29 16:49:57 ltuii002 kernel: Unable to handle kernel NULL pointer dereferencemelim[3468]: Oops 8804682956800^M Aug 29 16:49:57 ltuii002 kernel: ^M Aug 29 16:49:57 ltuii002 kernel: Pid: 3468, comm: melim^M Aug 29 16:49:57 ltuii002 kernel: EIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.393.ent)^M Aug 29 16:49:57 ltuii002 kernel: psr : 0000101008026018 ifs : 8000000000000e24 ip : [<e00000000446f260>] Not tainted^M Aug 29 16:49:57 ltuii002 kernel: unat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003^M Aug 29 16:49:57 ltuii002 kernel: rnat: 00000000000000bf bsps: 0000000000000fff pr : 8002924155aa9967^M Aug 29 16:49:57 ltuii002 kernel: ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f^M Aug 29 16:49:57 ltuii002 kernel: b0 : e00000000446f250 b6 : e0000000047f80e0 b7 : e0000000047f4fa0^M Aug 29 16:49:57 ltuii002 kernel: f6 : 0fffbccccccccc8c00000 f7 : 0ffdcb640000000000000^M Aug 29 16:49:57 ltuii002 kernel: f8 : 100029000000000000000 f9 : 10002a000000000000000^M Aug 29 16:49:57 ltuii002 kernel: r1 : e000000004c9bd00 r2 : e0000000018a7e60 r3 : 000000000000416a^M Aug 29 16:49:57 ltuii002 kernel: r8 : 0000000000000066 r9 : 0000000000000000 r10 : 0000000000000000^M Aug 29 16:49:57 ltuii002 kernel: r11 : e0000000018a0000 r12 : e00000001b05ef50 r13 : e00000001b058000^M Aug 29 16:49:57 ltuii002 kernel: r14 : 0000000000000001 r15 : 0000000000000000 r16 : e0000000018a7e48^M Aug 29 16:49:58 ltuii002 kernel: r17 : 0000000000004000 r18 : 0000000000004000 r19 : e000000004b68580^M Aug 29 16:49:58 ltuii002 kernel: r20 : e000000004abb8e8 r21 : e0000000047f4d60 r22 : 0000000000020000^M Aug 29 16:49:58 ltuii002 kernel: r23 : e000000004b66d70 r24 : 0000000000000060 r25 : 0000000000000000^M Aug 29 16:49:58 ltuii002 kernel: r26 : 0000000000000000 r27 : 00000000100000c0 r28 : 0000000000800000^M Aug 29 16:49:58 ltuii002 kernel: r29 : 0000000000000001 r30 : e000000000025a00 r31 : e000000004b66d70^M Aug 29 16:49:58 ltuii002 kernel: ^M Aug 29 16:49:58 ltuii002 kernel: Call Trace: [<e0000000044155c0>] sp=0xe00000001b05eb60 bsp=0xe00000001b059460 show_stack [kernel] 0x80^M Aug 29 16:49:58 ltuii002 kernel: [<e000000004430150>] sp=0xe00000001b05ed20 bsp=0xe00000001b059438 die [kernel] 0x1b0^M Aug 29 16:49:58 ltuii002 kernel: [<e000000004451a70>] sp=0xe00000001b05ed20 bsp=0xe00000001b0593d8 ia64_do_page_fault [kernel] 0x310^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000440e680>] sp=0xe00000001b05edb0 bsp=0xe00000001b0593d8 ia64_leave_kernel [kernel] 0x0^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000446f260>] sp=0xe00000001b05ef50 bsp=0xe00000001b0592b8 elf_core_dump [kernel] 0x640^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000452cae0>] sp=0xe00000001b05fd80 bsp=0xe00000001b059260 do_coredump [kernel] 0x500^M Aug 29 16:49:58 ltuii002 kernel: [<e0000000044a7810>] sp=0xe00000001b05fdd0 bsp=0xe00000001b0591e8 get_signal_to_deliver [kernel] 0x630^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000442e7f0>] sp=0xe00000001b05fdd0 bsp=0xe00000001b059180 ia64_do_signal [kernel] 0xd0^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000440eac0>] sp=0xe00000001b05fe50 bsp=0xe00000001b059130 handle_signal_delivery [kernel] 0x40^M Aug 29 16:49:58 ltuii002 kernel: [<e00000000440e6f0>] sp=0xe00000001b05fe60 bsp=0xe00000001b059130 ia64_leave_kernel [kernel] 0x70^M <4>IA32 syscall #252 issued, maybe we should implement it ^MAug 29 16:50:10 ltuii002 kernel: <4>IA32 syscall #252 issued, maybe we should implement it^M Could it be it has something to do with the nanosleep 32 Bit implementation ? I've seen that call one time in gdb just before the machine went down with .411 kernel: Program received signal SIGTERM, Terminated. 0x400165ce in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 (gdb) s Single stepping until exit from function _dl_sysinfo_int80, which has no line number information. (now kill <pid>) Program received signal SIGSEGV, Segmentation fault. 0x400eda8e in nanosleep () from /lib/tls/libc.so.6 (gdb) s Single stepping until exit from function nanosleep, which has no line number information. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. (gdb) s BTW it is not possible to strace the process termination on the .393 kernel. The only things i get: select(0x1, 0xbffff828, 0, 0xbffff7a8, 0xbffff79c) = -514 --- SIGTERM (Terminated) @ 400165ce (bfa) --- Process 3450 detached
Here's how to reproduce the problem (kernel messages like reported, but without kernel crash, but in my opinion this should be sufficient to locate the issue): Write a trivial program, that immediately dumps core, e.g.: main() { *((char *) 2) = 5; } compile it on a x86 machine (e.g. Xeon) to become an i386 executable, then start it on an Itanium. It is important, that the coredumpsize resource is not set to 0, so first set it to unlimited (e.g. for csh: limit coredumpsize unlimited or for sh: ulimit -c unlimited). Immediately the following messages appear in the syslog: Sep 1 12:58:55 ltuii002 kernel: sizeof(elf_gregset_t) (1024) != sizeof(struct pt_regs) (400) Sep 1 12:58:55 ltuii002 kernel: kernel BUG at /usr/src/build/293850-ia64/BUILD/kernel-2.4.21/linux-2.4.21/include/linux/elfcore.h:94! Sep 1 12:58:55 ltuii002 kernel: Unable to handle kernel NULL pointer dereferences[22128]: Oops 8804682956800 Sep 1 12:58:55 ltuii002 kernel: Sep 1 12:58:55 ltuii002 kernel: Pid: 22128, comm: s Sep 1 12:58:55 ltuii002 kernel: EIP is at elf_core_dump [kernel] 0x640 (2.4.21-1.1931.2.393.ent) Sep 1 12:58:55 ltuii002 kernel: psr : 0000101008026038 ifs : 8000000000000e24 ip : [<e00000000446f260>] Not tainted Sep 1 12:58:55 ltuii002 kernel: unat: 0000000000000000 pfs : 0000000000000e24 rsc : 0000000000000003 Sep 1 12:58:55 ltuii002 kernel: rnat: 00000000000000bf bsps: 0000000000000fff pr : 8002924155aa9967 Sep 1 12:58:55 ltuii002 kernel: ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f Sep 1 12:58:55 ltuii002 kernel: b0 : e00000000446f250 b6 : e0000000044bc760 b7 : e0000000047f4fa0 Sep 1 12:58:55 ltuii002 kernel: f6 : 0fffbccccccccc8c00000 f7 : 0ffdcb640000000000000 Sep 1 12:58:55 ltuii002 kernel: f8 : 100029000000000000000 f9 : 10002a000000000000000 Sep 1 12:58:55 ltuii002 kernel: r1 : e000000004c9bd00 r2 : e00000003ed57e60 r3 : 000000000001f584 Sep 1 12:58:55 ltuii002 kernel: r8 : 0000000000000066 r9 : 0000000000000000 r10 : 0000000000000000 Sep 1 12:58:55 ltuii002 kernel: r11 : e00000003ed50000 r12 : e00000000d5fef50 r13 : e00000000d5f8000 Sep 1 12:58:55 ltuii002 kernel: r14 : 0000000000000001 r15 : 0000000000000000 r16 : e00000003ed57e48 Sep 1 12:58:55 ltuii002 kernel: r17 : 0000000000004000 r18 : 0000000000004000 r19 : e000000004b68580 Sep 1 12:58:55 ltuii002 kernel: r20 : e000000004abb8e8 r21 : e0000000047f4d60 r22 : 0000000000020000 Sep 1 12:58:55 ltuii002 kernel: r23 : e000000004b66d70 r24 : 0000000000000060 r25 : 0000000000000000 Sep 1 12:58:55 ltuii002 kernel: r26 : 0000000000000000 r27 : 00000000100000c0 r28 : 0000000000800000 Sep 1 12:58:55 ltuii002 kernel: r29 : 0000000000000001 r30 : e000000000025a00 r31 : e000000004b66d70 Sep 1 12:58:55 ltuii002 kernel: Sep 1 12:58:55 ltuii002 kernel: Call Trace: [<e0000000044155c0>] sp=0xe00000000d5feb60 bsp=0xe00000000d5f9460 show_stack [kernel] 0x80 Sep 1 12:58:55 ltuii002 kernel: [<e000000004430150>] sp=0xe00000000d5fed20 bsp=0xe00000000d5f9438 die [kernel] 0x1b0 Sep 1 12:58:55 ltuii002 kernel: [<e000000004451a70>] sp=0xe00000000d5fed20 bsp=0xe00000000d5f93d8 ia64_do_page_fault [kernel] 0x310 Sep 1 12:58:55 ltuii002 kernel: [<e00000000440e680>] sp=0xe00000000d5fedb0 bsp=0xe00000000d5f93d8 ia64_leave_kernel [kernel] 0x0 Sep 1 12:58:55 ltuii002 kernel: [<e00000000446f260>] sp=0xe00000000d5fef50 bsp=0xe00000000d5f92b8 elf_core_dump [kernel] 0x640 Sep 1 12:58:55 ltuii002 kernel: [<e00000000452cae0>] sp=0xe00000000d5ffd80 bsp=0xe00000000d5f9260 do_coredump [kernel] 0x500 Sep 1 12:58:55 ltuii002 kernel: [<e0000000044a7810>] sp=0xe00000000d5ffdd0 bsp=0xe00000000d5f91e8 get_signal_to_deliver [kernel] 0x630 Sep 1 12:58:55 ltuii002 kernel: [<e00000000442e7f0>] sp=0xe00000000d5ffdd0 bsp=0xe00000000d5f9180 ia64_do_signal [kernel] 0xd0 Sep 1 12:58:55 ltuii002 kernel: [<e00000000440eac0>] sp=0xe00000000d5ffe50 bsp=0xe00000000d5f9130 handle_signal_delivery [kernel] 0x40 Sep 1 12:58:55 ltuii002 kernel: [<e00000000440e6f0>] sp=0xe00000000d5ffe60 bsp=0xe00000000d5f9130 ia64_leave_kernel [kernel] 0x70
Maybe i'm wrong, AFAIS from the code is, that ia32 core dump is not really supported under Itanium Linux. So probably it should be better hardcoded coredumpsize = 0 for now ?
Created attachment 97280 [details] /var/log/messages snippet when doing I/O testing on external disks System seems to work OK but I am perplexed with these messages clogging up the system logfile.
this has long since been fixed. pls update the kernel. closing.