Bug 199138

Summary: kernel panic during install bootup from create_gate_table
Product: [Fedora] Fedora Reporter: Doug Chapman <dchapman>
Component: gccAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rawhideCC: jakub, prarit, wtogami
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-31 18:57:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 163350, 199595, 199634    

Description Doug Chapman 2006-07-17 14:31:58 UTC
Description of problem:
Booting the kernel for installation on my HP Integrity servers I get the
following panic.  Looks like something in create_gate_table.  I will try
installing an older rev and upgrade to this kernel to see if I get the same thing.


IP route cache hash table entries: 1048576 (order: 9, 8388608 bytes)
TCP established hash table entries: 4194304 (order: 13, 134217728 bytes)
TCP bind hash table entries: 65536 (order: 7, 2097152 bytes)
TCP: Hash tables configured (established 4194304 bind 65536)
TCP reno registered
perfmon: version 2.0 IRQ 238
perfmon: Itanium 2 PMU detected, 16 PMCs, 18 PMDs, 4 counters (47 bits)
kernel unaligned access to 0xa000000000000634, ip=0xa000000100039eb0
Unable to handle kernel paging request at virtual address a010000600002682
swapper[1]: Oops 8813272891392 [1]
Modules linked in:

Pid: 1, CPU 0, comm:              swapper
psr : 00001010085a6010 ifs : 8000000000000590 ip  : [<a0000001007082f0>]    Not
tainted
ip is at create_gate_table+0x150/0x380
unat: 0000000000000000 pfs : 0000000000000590 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000009541
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001007082b0 b6  : a0000001007081a0 b7  : a00000010021a5c0
f6  : 1003e0000000000000040 f7  : 0ffdd8000000000000000
f8  : 10004ffffd27000000000 f9  : 10005b000000000000000
f10 : 1003e0000000000000038 f11 : 1003e0000000000000118
r1  : a000000100b6e160 r2  : 0000000000000000 r3  : e0000040fe569004
r8  : e0000040fd5ed900 r9  : a0000001009725e8 r10 : 0000000000000001
r11 : a000000100972610 r12 : e0000040fe56fd30 r13 : e0000040fe568000
r14 : 0000000000004000 r15 : a000000000000638 r16 : 00000000ffffffff
r17 : a000000000000000 r18 : 00000000000000b0 r19 : 0000000000000090
r20 : 0000000000000012 r21 : 0001000000000012 r22 : a010000600002682
r23 : 0010000600002682 r24 : a000000000000610 r25 : 000000000000007c
r26 : e0000040fd5ed910 r27 : e0000040fd5ed908 r28 : a00000010093ddf0
r29 : 0000000000000000 r30 : a00000010093ddf8 r31 : e0000040fe569004

Call Trace:
 [<a000000100013da0>] show_stack+0x40/0xa0
                                sp=e0000040fe56f8c0 bsp=e0000040fe569220
 [<a0000001000146a0>] show_regs+0x840/0x880
                                sp=e0000040fe56fa90 bsp=e0000040fe5691c0
 [<a0000001000335c0>] die+0x1c0/0x2c0
                                sp=e0000040fe56fa90 bsp=e0000040fe569178
 [<a0000001005edf20>] ia64_do_page_fault+0x8e0/0xa20
                                sp=e0000040fe56fab0 bsp=e0000040fe569128
 [<a00000010000c6e0>] ia64_leave_kernel+0x0/0x280
                                sp=e0000040fe56fb60 bsp=e0000040fe569128
 [<a0000001007082f0>] create_gate_table+0x150/0x380
                                sp=e0000040fe56fd30 bsp=e0000040fe5690a8
 [<a000000100009ab0>] init+0x4f0/0x900
                                sp=e0000040fe56fd30 bsp=e0000040fe569078
 [<a000000100012310>] kernel_thread_helper+0x30/0x60
                                sp=e0000040fe56fe30 bsp=e0000040fe569050
 [<a0000001000090c0>] start_kernel_thread+0x20/0x40
                                sp=e0000040fe56fe30 bsp=e0000040fe569050
 <0>Kernel panic - not syncing: Attempted to kill init!


Version-Release number of selected component (if applicable):
kernel-2.6.17-1.2396.fc6
rawhide-20060714


How reproducible:
Only tried 1 time so far, will try other hosts.


Steps to Reproduce:
1. boot the install kernel on ia64
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Chapman 2006-07-17 14:36:30 UTC
Same results on kernel-2.6.17-1.2391.fc6


Comment 2 Doug Chapman 2006-07-17 16:05:24 UTC
Same on 2.6.17-1.2405.fc6 from the rawhide-20060717 tree as well.


Comment 3 Doug Chapman 2006-07-17 19:54:21 UTC
I discovered that if I build the kernel from source on a RHEL4.4 system I can
boot it there.  If I install the kernel rpm binary on that RHEL4.4 system I see
the same panic as above.  So, appears to be releated to the build environment
somehow.


Comment 4 Prarit Bhargava 2006-07-20 13:58:54 UTC
This appears to be a problem introduced with gcc-4.1.1.7.  If I boot a kernel
that was built with 4.1.1.6 (last kernel built was 2372) I do not see a panic. 
If I boot a kernel built with 4.1.1.7 (first kernel built was 2391) I get an oops.



Comment 5 Prarit Bhargava 2006-07-20 17:32:00 UTC
I installed FC5 "unofficial" ia64 and built a kernel using gcc 4.1.1 .
I then yum updated all the packages on the system to rawhide latest and installed
the 4.1.1 kernel to get a rawhide latest box.

I compiled the kernel using 4.1.1.8 (which is the latest gcc) and the kernel
panics as above.

Doug is looking closely at the panic, while I'm searching through gcc to see
if we can narrow down the problem.

P.

Comment 6 Doug Chapman 2006-07-20 19:39:44 UTC
I have some more details from looking at this from the kernel side.  The reason
for the panic is -

at arch/ia64/kernel/unwind.c:2179

end = (struct unw_table_entry *) ((char *) start + punw->p_memsz);
                                                   ^^^^^^^^^^^^^

the value of punw->p_memsz is wrong.  With the recent compilers this is 0x7c
while with either an older FC6 or an RHEL4 compiler it is always 0x48.  Note
that the address where this lives is based on some constants and I have verified
that punw as well as &punw->p_memsize is the same regardless of the compiler
version so it appears we are looking in the right location.

So, now I need to determine where the value for punw->p_memsz gets initialized,
appears that either it is being initialized wrong or something is overwriting it.



Comment 7 Doug Chapman 2006-07-20 20:40:27 UTC
Another bit of useful info.  I get the same panic if I compile 2.6.17 without
any of the redhat patches.  We should discuss this with ia64-list.


Comment 8 Prarit Bhargava 2006-07-20 21:04:39 UTC
Er ... when you're compiling you're using the RH gcc?  What happens if you
compile 2.6.17 + no RH patches + "trunk" gcc?

Still not settled that it is a kernel issue ;) 

P.

Comment 9 Doug Chapman 2006-07-20 22:39:51 UTC
punw->p_memsz comes from the unwind info for the ELF header, this gets plugged
into the kernel via a linker script: arch/ia64/kernel/gate.lds.S

the linker calls this "structure" .IA_64.unwind_info.  I assume this is
generated by the compiler but it might be the assembler or even the linker
itself.  If we find the code that generates this then I bet we have our culprit.

By using kdb I was able to determine that the rest of the structure pointed to
by punw (which is of type Elf64_Phdr) looks good except for p_memsz and
p_filesz.  Both are the same (incorrect) value.



Comment 10 Prarit Bhargava 2006-07-20 22:46:05 UTC
I've left both the linker and the assembler as constants during the tests.  So
I'm leaning toward gcc for now ...

Still testing ...

P.

Comment 11 Prarit Bhargava 2006-07-21 12:35:41 UTC
I ran a few tests:

I built a kernel with the 20060711 upstream RH version of gcc and the kernel
boots without any issues.

I built a kernel with the 20060711 RH RPM version of gcc and the kernel does not
boot.

I can flip between gcc's on my system by setting an alias for one or another. 
By switching between gcc's I can generate kernels that do boot and kernels that
do not.

I also tried building gcc from the RPM sources using the .configure options
provided from 

[root@altix3 ~]# /usr/bin/gcc -v
Using built-in specs.
Target: ia64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
--disable-dssi --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre
--host=ia64-redhat-linux
Thread model: posix
gcc version 4.1.1 20060711 (Red Hat 4.1.1-7)

I used this self-built RPM version to build a kernel and the resultant kernel
booted without any issues.

It clearly looks like gcc is the culprit, or at least some mismatch of gcc and
libraries.  Jakub, any ideas on what else to try?

P.

Comment 12 Doug Chapman 2006-08-31 18:57:34 UTC
this problem has since been resolved and verified.