Description of problem: Booting the kernel for installation on my HP Integrity servers I get the following panic. Looks like something in create_gate_table. I will try installing an older rev and upgrade to this kernel to see if I get the same thing. IP route cache hash table entries: 1048576 (order: 9, 8388608 bytes) TCP established hash table entries: 4194304 (order: 13, 134217728 bytes) TCP bind hash table entries: 65536 (order: 7, 2097152 bytes) TCP: Hash tables configured (established 4194304 bind 65536) TCP reno registered perfmon: version 2.0 IRQ 238 perfmon: Itanium 2 PMU detected, 16 PMCs, 18 PMDs, 4 counters (47 bits) kernel unaligned access to 0xa000000000000634, ip=0xa000000100039eb0 Unable to handle kernel paging request at virtual address a010000600002682 swapper[1]: Oops 8813272891392 [1] Modules linked in: Pid: 1, CPU 0, comm: swapper psr : 00001010085a6010 ifs : 8000000000000590 ip : [<a0000001007082f0>] Not tainted ip is at create_gate_table+0x150/0x380 unat: 0000000000000000 pfs : 0000000000000590 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000009541 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001007082b0 b6 : a0000001007081a0 b7 : a00000010021a5c0 f6 : 1003e0000000000000040 f7 : 0ffdd8000000000000000 f8 : 10004ffffd27000000000 f9 : 10005b000000000000000 f10 : 1003e0000000000000038 f11 : 1003e0000000000000118 r1 : a000000100b6e160 r2 : 0000000000000000 r3 : e0000040fe569004 r8 : e0000040fd5ed900 r9 : a0000001009725e8 r10 : 0000000000000001 r11 : a000000100972610 r12 : e0000040fe56fd30 r13 : e0000040fe568000 r14 : 0000000000004000 r15 : a000000000000638 r16 : 00000000ffffffff r17 : a000000000000000 r18 : 00000000000000b0 r19 : 0000000000000090 r20 : 0000000000000012 r21 : 0001000000000012 r22 : a010000600002682 r23 : 0010000600002682 r24 : a000000000000610 r25 : 000000000000007c r26 : e0000040fd5ed910 r27 : e0000040fd5ed908 r28 : a00000010093ddf0 r29 : 0000000000000000 r30 : a00000010093ddf8 r31 : e0000040fe569004 Call Trace: [<a000000100013da0>] show_stack+0x40/0xa0 sp=e0000040fe56f8c0 bsp=e0000040fe569220 [<a0000001000146a0>] show_regs+0x840/0x880 sp=e0000040fe56fa90 bsp=e0000040fe5691c0 [<a0000001000335c0>] die+0x1c0/0x2c0 sp=e0000040fe56fa90 bsp=e0000040fe569178 [<a0000001005edf20>] ia64_do_page_fault+0x8e0/0xa20 sp=e0000040fe56fab0 bsp=e0000040fe569128 [<a00000010000c6e0>] ia64_leave_kernel+0x0/0x280 sp=e0000040fe56fb60 bsp=e0000040fe569128 [<a0000001007082f0>] create_gate_table+0x150/0x380 sp=e0000040fe56fd30 bsp=e0000040fe5690a8 [<a000000100009ab0>] init+0x4f0/0x900 sp=e0000040fe56fd30 bsp=e0000040fe569078 [<a000000100012310>] kernel_thread_helper+0x30/0x60 sp=e0000040fe56fe30 bsp=e0000040fe569050 [<a0000001000090c0>] start_kernel_thread+0x20/0x40 sp=e0000040fe56fe30 bsp=e0000040fe569050 <0>Kernel panic - not syncing: Attempted to kill init! Version-Release number of selected component (if applicable): kernel-2.6.17-1.2396.fc6 rawhide-20060714 How reproducible: Only tried 1 time so far, will try other hosts. Steps to Reproduce: 1. boot the install kernel on ia64 2. 3. Actual results: Expected results: Additional info:
Same results on kernel-2.6.17-1.2391.fc6
Same on 2.6.17-1.2405.fc6 from the rawhide-20060717 tree as well.
I discovered that if I build the kernel from source on a RHEL4.4 system I can boot it there. If I install the kernel rpm binary on that RHEL4.4 system I see the same panic as above. So, appears to be releated to the build environment somehow.
This appears to be a problem introduced with gcc-4.1.1.7. If I boot a kernel that was built with 4.1.1.6 (last kernel built was 2372) I do not see a panic. If I boot a kernel built with 4.1.1.7 (first kernel built was 2391) I get an oops.
I installed FC5 "unofficial" ia64 and built a kernel using gcc 4.1.1 . I then yum updated all the packages on the system to rawhide latest and installed the 4.1.1 kernel to get a rawhide latest box. I compiled the kernel using 4.1.1.8 (which is the latest gcc) and the kernel panics as above. Doug is looking closely at the panic, while I'm searching through gcc to see if we can narrow down the problem. P.
I have some more details from looking at this from the kernel side. The reason for the panic is - at arch/ia64/kernel/unwind.c:2179 end = (struct unw_table_entry *) ((char *) start + punw->p_memsz); ^^^^^^^^^^^^^ the value of punw->p_memsz is wrong. With the recent compilers this is 0x7c while with either an older FC6 or an RHEL4 compiler it is always 0x48. Note that the address where this lives is based on some constants and I have verified that punw as well as &punw->p_memsize is the same regardless of the compiler version so it appears we are looking in the right location. So, now I need to determine where the value for punw->p_memsz gets initialized, appears that either it is being initialized wrong or something is overwriting it.
Another bit of useful info. I get the same panic if I compile 2.6.17 without any of the redhat patches. We should discuss this with ia64-list.
Er ... when you're compiling you're using the RH gcc? What happens if you compile 2.6.17 + no RH patches + "trunk" gcc? Still not settled that it is a kernel issue ;) P.
punw->p_memsz comes from the unwind info for the ELF header, this gets plugged into the kernel via a linker script: arch/ia64/kernel/gate.lds.S the linker calls this "structure" .IA_64.unwind_info. I assume this is generated by the compiler but it might be the assembler or even the linker itself. If we find the code that generates this then I bet we have our culprit. By using kdb I was able to determine that the rest of the structure pointed to by punw (which is of type Elf64_Phdr) looks good except for p_memsz and p_filesz. Both are the same (incorrect) value.
I've left both the linker and the assembler as constants during the tests. So I'm leaning toward gcc for now ... Still testing ... P.
I ran a few tests: I built a kernel with the 20060711 upstream RH version of gcc and the kernel boots without any issues. I built a kernel with the 20060711 RH RPM version of gcc and the kernel does not boot. I can flip between gcc's on my system by setting an alias for one or another. By switching between gcc's I can generate kernels that do boot and kernels that do not. I also tried building gcc from the RPM sources using the .configure options provided from [root@altix3 ~]# /usr/bin/gcc -v Using built-in specs. Target: ia64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --host=ia64-redhat-linux Thread model: posix gcc version 4.1.1 20060711 (Red Hat 4.1.1-7) I used this self-built RPM version to build a kernel and the resultant kernel booted without any issues. It clearly looks like gcc is the culprit, or at least some mismatch of gcc and libraries. Jakub, any ideas on what else to try? P.
this problem has since been resolved and verified.