Bug 176612 - xw6400 System panic while installing RHEL4-U3
Summary: xw6400 System panic while installing RHEL4-U3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Jim Paradis
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 172741 181409 185624
TreeView+ depends on / blocked
 
Reported: 2005-12-27 16:11 UTC by Jeff Burke
Modified: 2013-08-06 01:17 UTC (History)
8 users (show)

Fixed In Version: RHSA-2006-0575
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:48:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Console output of sootup just before the kernel panic (7.50 KB, text/plain)
2005-12-27 16:13 UTC, Jeff Burke
no flags Details
xw6400 BIOS update (1.07 MB, application/octet-stream)
2006-03-10 20:30 UTC, Chris Williams
no flags Details
xw8400 BIOS update (1.07 MB, application/octet-stream)
2006-03-10 20:31 UTC, Chris Williams
no flags Details
HP xw8400 system BIOS update package, version 0.28 (1.07 MB, application/octet-stream)
2006-03-23 23:18 UTC, Jeff Burrell
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0575 0 normal SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 4 2006-08-10 04:00:00 UTC

Description Jeff Burke 2005-12-27 16:11:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
While trying to do a PXE install with RHEL4-U3 re1221.0 The xw6400 system will panic.

Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff
RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8>
 <0>Kernel panic - not syncing: Oops
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff8014ff5e>{__queue_work+14}


Version-Release number of selected component (if applicable):
kernel-2.6.9-27.EL

How reproducible:
Always

Steps to Reproduce:
1.Use the xw6400 hardware try to install RHEL4-U3-re1221.0 AS x86_64

  

Actual Results:  ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at apic:305
invalid operand: 0000 [1]
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.9-27.EL
RIP: 0010:[<ffffffff8054596d>] <ffffffff8054596d>{setup_local_APIC+27}
RSP: 0000:0000010002173ef8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000050014 RCX: ffffffff8042d670
RDX: 0000000000000000 RSI: ffffffff8042d670 RDI: ffffffff8036bdee
RBP: 0000000000000014 R08: 000000013ffcd9c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110)
Stack: 0000000000000004 0000000004040800 0000000000000000 ffffffff80545eb2
       0000000000000001 0000000000000000 000000013ffcd9c0 ffffffff8010c4aa
       ffffffff8010c3d4 0000000004040800
Call Trace:<ffffffff80545eb2>{APIC_init_uniprocessor+133} <ffffffff8010c4aa>{init+214}
       <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8}
       <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0}


Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff
RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8>
 <0>Kernel panic - not syncing: Oops
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff8014ff5e>{__queue_work+14}
PML4 0
Oops: 0000 [2]
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.9-27.EL
RIP: 0010:[<ffffffff8014ff5e>] <ffffffff8014ff5e>{__queue_work+14}
RSP: 0000:ffffffff804a4358  EFLAGS: 00010046
RAX: ffffffff8044c188 RBX: 0000000000000000 RCX: ffffffff804f1be0
RDX: 0000000000000000 RSI: ffffffff8044c180 RDI: 0000000000000000
RBP: ffffffff804a4398 R08: ffffffff804a4398 R09: 0000000000000246
R10: 0000000000000246 R11: 0000010002173bc8 R12: 0000000000000062
R13: 0000010002173bc8 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110)
Stack: ffffffff804a4398 0000000000000246 ffffffff8042f560 ffffffff8015009d
       0000000000000246 ffffffff801446a1 0000000000000000 0000000000000246
       ffffffff804a4398 ffffffff804a4398
Call Trace:<IRQ> <ffffffff8015009d>{queue_work+41} <ffffffff801446a1>{run_timer_softirq+591}
       <ffffffff8013faec>{__do_softirq+76} <ffffffff8013fb73>{do_softirq+49}
       <ffffffff80113c57>{do_IRQ+664} <ffffffff80110fdb>{ret_from_intr+0}
        <EOI> <ffffffff8021a58a>{__delay+9} <ffffffff8013954a>{panic+535}
       <ffffffff8011214e>{oops_end+159} <ffffffff80112169>{oops_end+186}
       <ffffffff8011225a>{die+54} <ffffffff801125e2>{do_invalid_op+145}
       <ffffffff8054596d>{setup_local_APIC+27} <ffffffff80111309>{error_exit+0}
       <ffffffff8054596d>{setup_local_APIC+27} <ffffffff80545969>{setup_local_APIC+23}
       <ffffffff80545eb2>{APIC_init_uniprocessor+133} <ffffffff8010c4aa>{init+214}
       <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8}
       <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0}


Code: 48 81 3f 3c 4b 24 1d 48 89 f9 ba 52 00 00 00 0f 85 a2 00 00
RIP <ffffffff8014ff5e>{__queue_work+14} RSP <ffffffff804a4358>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Oops

IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23


Expected Results:  System should install with out a kernel panic.


Additional info:

We have two xw6400 prototype workstations, one is in RH-Westford(Jeff Burke) and one to RH-Raleigh(Chris Williams). These workstations are a smaller form factor version of the xw8400(But, Are not the same). They are based on the Intel "Greencreek" MCH, ESB-2 southbridge and "Dempsey" processor. 

Partner: Hewlett-Packard Co.
RH Partner manager: Ron Pacheco
System/component description: xw6400 prototype workstation - based on Intel "Greencreek" MCH and ESB2 southbridge

Component list:
- xw6400 base platform(motherboard, chassis, cables, power supply)
- (2) dual core Intel Dempsey processors
- (2) 512 MB FB-DIMM, for a total of 1 GB
- (1) 80 GB SATA-II disk
- (1) DVD-ROM drive
- (1) floppy drive
- (1) Nvidia NVS285 PCI-E graphics card 

FWIW: RHEL3-U7 installs and runs.

Comment 1 Jeff Burke 2005-12-27 16:13:25 UTC
Created attachment 122597 [details]
Console output of sootup just before the kernel panic

Comment 4 John W. Linville 2006-01-04 14:39:25 UTC
I smell a BIOS issue...does the box in question have the latest BIOS 
available? 
 
What previous RHEL or FC versions are known to install on this box? 

Comment 5 Jeff Burke 2006-01-04 15:00:02 UTC
According to HP it is at the latest and greatest.  As of right now I know that
RHEL3-U7 installs and runs.

I just received this system the Thursday before the holiday break. I was told it
is "New" hardware. I believe we only have two of these systems. One is with
Chris W in RDU and I have the other.

Not sure if this data is helpful right now but here is a lspci, lspci -n output
from RHEL3-U7 on that system.
00:00.0 Host bridge: Intel Corporation Workstation Memory Controller Hub (rev 11)
00:02.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 2 (rev 11)
00:03.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 3 (rev 11)
00:04.0 PCI bridge: Intel Corporation Server PCI Express x16 Port 4-7 (rev 11)
00:05.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 5 (rev 11)
00:06.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 6 (rev 11)
00:07.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 7 (rev 11)
00:10.0 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11)
00:10.1 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11)
00:10.2 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11)
00:11.0 Host bridge: Intel Corporation Reserved Registers (rev 11)
00:13.0 Host bridge: Intel Corporation Reserved Registers (rev 11)
00:15.0 Host bridge: Intel Corporation Server FBD Registers (rev 11)
00:16.0 Host bridge: Intel Corporation Server FBD Registers (rev 11)
00:1b.0 Audio device: Intel Corporation Enterprise Southbridge High Definition
Audio (rev 09)
00:1c.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Root
Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #1
(rev 09)
00:1d.1 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #2
(rev 09)
00:1d.2 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #3
(rev 09)
00:1d.3 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #4
(rev 09)
00:1d.7 USB Controller: Intel Corporation Enterprise Southbridge EHCI USB (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation Enterprise Southbridge LPC (rev 09)
00:1f.1 IDE interface: Intel Corporation Enterprise Southbridge PATA (rev 09)
00:1f.2 IDE interface: Intel Corporation Enterprise Southbridge SATA cc=IDE (rev 09)
10:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Upstream Port (rev 01)
10:00.3 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express to
PCI-X Bridge (rev 01)
1e:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Downstream Port E1 (rev 01)
1e:01.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express
Downstream Port E2 (rev 01)
1f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit
Ethernet PCI Express (rev 01)
40:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1)

00:00.0 Class 0600: 8086:25c0 (rev 11)
00:02.0 Class 0604: 8086:25e2 (rev 11)
00:03.0 Class 0604: 8086:25e3 (rev 11)
00:04.0 Class 0604: 8086:25fa (rev 11)
00:05.0 Class 0604: 8086:25e5 (rev 11)
00:06.0 Class 0604: 8086:25e6 (rev 11)
00:07.0 Class 0604: 8086:25e7 (rev 11)
00:10.0 Class 0600: 8086:25f0 (rev 11)
00:10.1 Class 0600: 8086:25f0 (rev 11)
00:10.2 Class 0600: 8086:25f0 (rev 11)
00:11.0 Class 0600: 8086:25f1 (rev 11)
00:13.0 Class 0600: 8086:25f3 (rev 11)
00:15.0 Class 0600: 8086:25f5 (rev 11)
00:16.0 Class 0600: 8086:25f6 (rev 11)
00:1b.0 Class 0403: 8086:269a (rev 09)
00:1c.0 Class 0604: 8086:2690 (rev 09)
00:1d.0 Class 0c03: 8086:2688 (rev 09)
00:1d.1 Class 0c03: 8086:2689 (rev 09)
00:1d.2 Class 0c03: 8086:268a (rev 09)
00:1d.3 Class 0c03: 8086:268b (rev 09)
00:1d.7 Class 0c03: 8086:268c (rev 09)
00:1e.0 Class 0604: 8086:244e (rev d9)
00:1f.0 Class 0601: 8086:2670 (rev 09)
00:1f.1 Class 0101: 8086:269e (rev 09)
00:1f.2 Class 0101: 8086:2680 (rev 09)
10:00.0 Class 0604: 8086:3500 (rev 01)
10:00.3 Class 0604: 8086:350c (rev 01)
1e:00.0 Class 0604: 8086:3510 (rev 01)
1e:01.0 Class 0604: 8086:3514 (rev 01)
1f:00.0 Class 0200: 14e4:1600 (rev 01)
40:00.0 Class 0300: 10de:0165 (rev a1)



Comment 6 John W. Linville 2006-01-04 15:30:29 UTC
Chris Williams reports problems w/ RHEL3 on the xw6400 box as well, unless you 
use "noapic".  That would seem to match the crash info from comment 1. 
 
Can you install RHEL4 if you use "noapic" on the kernel command line? 

Comment 7 Jeff Burke 2006-01-04 15:42:28 UTC
Strange that I can use RHEL3-U7 with no issues. What update is he using?

Here is the results of trying RHEL4 with using the noapic option during boot.

Bootdata ok (command line is
initrd=redhat/rhel4-x86_64-AS/images/pxeboot/initrd.img
ks=http://qafiler.boston.redhat.com/Netboot/kickstarts/rhel4-x86_64-AS.ks
ramdisk_size=10000 BOOT_IMAGE=redhat/rhel4-x86_64-AS/images/pxeboot/vmlinuz
console=ttyS0,115200
console=tty0 noapic)
Linux version 2.6.9-27.EL (bhcompile.redhat.com) (gcc version
3.4.5 20051201 (Red Hat 3.4.5-2)) #1 Tue Dec 20 19:11:47 EST 2005
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003ffde800 (usable)
 BIOS-e820: 000000003ffde800 - 0000000040000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000e004cc96 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
No mptable found.
DMI 2.4 present.
ACPI: PM-Timer IO Port: 0xf808
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x04] enabled)
Processor #4 15:6 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] enabled)
Processor #0 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
Processor #2 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x06] enabled)
Processor #6 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x01] enabled)
Processor #1 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x03] enabled)
Processor #3 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x05] enabled)
Processor #5 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled)
Processor #7 15:6 APIC version 16
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
Setting APIC routing to flat
ACPI: Skipping IOAPIC probe due to 'noapic' option.
Using ACPI for processor (LAPIC) configuration information
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: HP       <6>Product ID: Workstation  <6>APIC at: 0xFEE00000
I/O APIC #1 Version 17 at 0xFEC00000.
I/O APIC #2 Version 17 at 0xFEC10000.
Processors: 1
Checking aperture...
Built 1 zonelists
Kernel command line: initrd=redhat/rhel4-x86_64-AS/images/pxeboot/initrd.img
ks=http://qafiler.boston.redhat.com/Netboot/kickstarts/rhel4-x86_64-AS.ks
ramdisk_size=10000 BOOT_IMAGE=redhat/rhel4-x86_64-AS/images/pxeboot/vmlinuz
console=ttyS0,115200 console=tty0 noapic
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 3.579545 MHz PM timer.
time.c: Detected 3200.270 MHz processor.
time.c: Using PIT/TSC based timekeeping.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Memory: 1021616k/1048440k available (2406k kernel code, 26152k reserved, 1306k
data, 164k init)
Calibrating delay using timer specific routine.. 6406.89 BogoMIPS (lpj=3203448)
Security Scaffold v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
There is already a security framework initialized, register_security failed.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256 (order: 0, 4096 bytes)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
using mwait in idle threads.
CPU:                    Genuine Intel(R) CPU 3.20GHz stepping 02
ACPI: IRQ9 SCI: Edge set to Level Trigger.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at apic:305
invalid operand: 0000 [1]
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.9-27.EL
RIP: 0010:[<ffffffff8054596d>] <ffffffff8054596d>{setup_local_APIC+27}
RSP: 0000:0000010002173ef8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000050014 RCX: ffffffff8042d670
RDX: 0000000000000000 RSI: ffffffff8042d670 RDI: ffffffff8036bdee
RBP: 0000000000000014 R08: 0000000100000246 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110)
Stack: 0000000000000004 0000000004040800 0000000000000000 ffffffff80545eb2
       0000000000000001 0000000000000000 0000000100000246 ffffffff8010c4aa
       ffffffff8010c3d4 0000000004040800
Call Trace:<ffffffff80545eb2>{APIC_init_uniprocessor+133}
<ffffffff8010c4aa>{init+214}
       <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8}
       <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0}


Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff
RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8>
 <0>Kernel panic - not syncing: Oops


Comment 8 Chris Williams 2006-01-04 16:12:39 UTC
I just tried to insyall U3 x86_64 RHEL4-U3-re1221.0 on teh 6400 I have and it
paniced, exact same message as comment #7

It panics with and with out noapic.

...
Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff
RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8>
 <0>Kernel panic - not syncing: Oops



Comment 9 John W. Linville 2006-01-04 18:29:55 UTC
Chris also reports that using an i686 distro works, so the problem would seem 
to relate to the x86_64 arch support. 

Comment 11 John W. Linville 2006-01-04 20:57:16 UTC
cww ok, I tested RHEL 3 x86_64 and it works fine so here is the break down 
cww RHEL4 x86 = works ok 
cww RHEL4 x86-64 = panics always 
cww RHEL3 x86 = SMP needs noapic UP is fine 
cww RHEL3 x86_64 = works ok 

Comment 12 Jim Paradis 2006-01-12 23:49:05 UTC
For the x86_64 version: This crash happens when a CPU tries to set up an APIC in
flat mode whose ID does not match the APIC ID of any CPU on the system.  I
notice that the first CPU seen when we come up has APIC ID 4, not 0, and it's
the only CPU in use.  I'm wondering if this is another CPU-enumeration issue. 
Continuing to investigate.


Comment 15 Kim Jensen 2006-01-24 18:04:39 UTC
The RHEL4U3 x86_64 panic described in comment #7 occurs on xw6400 AND xw8400 
(with the latest 0.22 BIOS) when two processor packages are present.  If you 
remove one processor package the panic goes away.  

Any idea why this panic is occuring?

Comment 16 Kim Jensen 2006-01-24 21:25:04 UTC
In reading through the smpboot.c code, I believe we are getting to
     if (!smp_found_config && !acpi_lapic)
      ....
      APIC_init_unprocessor

In order for !apic_lapic to be true, it looks as though the parse of the MADT 
LAPIC entries or the parse of the MADT IO-APIC entries must fail.

Who sets up the Multiple APIC Description Table?  BIOS or kernel?

I'm looking to narrow down the scope of this issue so any data you can provide 
would be very much appreciated.

Comment 17 Jim Paradis 2006-01-25 17:18:20 UTC
The only problem with that scenario is that in get_smp_config() we see:

if (acpi_lapic && acpi_ioapic) {
                printk(KERN_INFO "Using ACPI (MADT) for SMP configuration
information\n");
                return;
        }
        else if (acpi_lapic)
                printk(KERN_INFO "Using ACPI for processor (LAPIC) configuration
information\n");

The latter message appears in the console log of the crash, so acpi_lapic must
be set already.  I notice that acpi_ioapic is *not* set, so we may have
something else going on here... continuing to investigate


Comment 18 Kim Jensen 2006-01-25 19:01:54 UTC
If acpi_lapic is set - does this imply that the kernel was NOT able to parse 
the MADT LAPIC entries?

If acpi_ioapic is not set - does this imply that the kernel was able to parse 
the MADT IO-APIC entries?

From what you have found - Can we (from the BIOS perspective) focus on the MADT 
LAPIC entries as the place where there could be an issue?  Or an I reading too 
much into this :-)

Thanks for the update!

Comment 19 Jim Paradis 2006-01-26 23:26:20 UTC
No, if acpi_lapic is set it implies that the kernel *was* able to parse the MADT
LAPIC entries.  Same for acpi_ioapic; it is set if we successfully parsed the
MADT IO-APIC entries.

        count = acpi_table_parse(ACPI_APIC, acpi_parse_madt);
        if (count == 1) {

                /*
                 * Parse MADT LAPIC entries
                 */
                error = acpi_parse_madt_lapic_entries();
                if (!error) {
                        acpi_lapic = 1;
                        clustered_apic_check();

                        /*
                         * Parse MADT IO-APIC entries
                         */
                        if (!acpi_disabled) {
                                error = acpi_parse_madt_ioapic_entries();
                                if (!error) {
                                        acpi_irq_model = ACPI_IRQ_MODEL_IOAPIC;
                                        acpi_irq_balance_set(NULL);
                                        acpi_ioapic = 1;

                                        smp_found_config = 1;
                                }
                        }


Comment 20 Jim Paradis 2006-01-27 01:31:39 UTC
Okay, I think I've found the one-line fix for this.

First of all, all the stuff from Comment 16 to Comment 19 are red herrings.

For whatever bizarre reason, the first CPU ID reported in the MADT table is 4,
not 0.  This may or may not be what the BIOS writers intended, but it's
something we should be able to handle.  The BUG that we're tripping over is in
setup_local_APIC():

        if (!apic_id_registered())
                BUG();

The reason this goes off is that the APIC ID of the running CPU is not found in
phys_cpu_present_map.

The call to APIC_init_uniprocessor() is entirely correct; since this is a
uniprocessor kernel (even though it finds multiple CPUs in the MADT), this is
what should be called.

Leading up to the call of setup_local_APIC() is this:

        phys_cpu_present_map = physid_mask_of_physid(0);
        apic_write_around(APIC_ID, boot_cpu_id);

        setup_local_APIC();

The first line in this sequence is incorrect; it should read:

        phys_cpu_present_map = physid_mask_of_physid(boot_cpu_id);

As long as the APIC ID of the boot CPU is 0, the old code just happens to work.
 As soon as it isn't, though, we fall over.

This sequence occurs in a few other places in the kernel as well.

I'm testing out a fix now.

Comment 21 Kim Jensen 2006-01-27 17:29:48 UTC
The CPU enumeration is 0, 4, 2, 6, 1, 5, 3, 7

Where the 
        1st core in the CPU0 package = 0
        1st core in the CPU1 package = 4
        2nd core in the CPU0 package = 2
        2nd core in the CPU1 package = 6
        1st core w/ HT in the CPU0 package = 1
        1st core w/ HT in the CPU1 package = 5
        2nd core w/ HT in the CPU0 package = 3
        2nd core w/ HT in the CPU1 package = 7

Does that line up with what you are seeing?  From what I read, you are somehow 
starting with the 1st core in CPU1 instead of the 1st core in CPU0.




Comment 22 Jim Paradis 2006-01-27 17:34:06 UTC
The kernel is seeing a different order in the MADT.  Snipped from the console
output:

Processor #4 15:6 APIC version 16
Processor #0 15:6 APIC version 16
Processor #2 15:6 APIC version 16
Processor #6 15:6 APIC version 16
Processor #1 15:6 APIC version 16
Processor #3 15:6 APIC version 16
Processor #5 15:6 APIC version 16
Processor #7 15:6 APIC version 16

I confirmed this on the 6400 we have in house


Comment 23 Kim Jensen 2006-01-27 21:11:36 UTC
Three things -


1- Can you tell me where the kernel is picking up the processor #?  SO, for 
instance, is the #4 value found in the first entry of the MADT table and the #0 
value found in the second entry of the MADT table?

2- We would like to have a two-way between HP and RH on Monday (1/30) to 
discuss this as we still do not understand how the kernel is coming up with the 
above enumeration when two processor packages are present.

3- I am able to use the testboot.iso on an xw6400 to startup anaconda with 2 
processor packages.



Comment 24 Jim Paradis 2006-01-27 21:50:41 UTC
The kernel picks up the processor numbers from the LAPIC entries in the MADT. 
It processes the entries in the order they appear in the table and uses the APIC
ID fields to assign APIC IDs to processors.  So yes, the #4 value is found in
the first entry and the #0 value is found in the second entry.

After some more spec-reading and code-reading, I think there may indeed be a
BIOS bug in either the way APIC IDs are assigned to CPUS, or the way that CPUs
are ordered in the table.  I didn't pay attention before to the "acpi_id" field
in the MADT LAPIC entries.  Note this excerpt from the boot log:

ACPI: LAPIC (acpi_id[0x01] lapic_id[0x04] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x06] enabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x03] enabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x05] enabled)
ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled)

Note that the ACPI IDs show up in order from 1 to 8, but the APIC IDs are not in
the expected order (4, 0, 2, 6... rather than 0, 2, 4, 6).

Either the table is being built wrong, or the wrong core is coming up as the
boot processor.

The kernel fix I posted is still needed in the general case, but what I describe
above suggests you may want to revisit the BIOS and see if it's really doing
what you intend it to do.


Comment 25 Jeff Burke 2006-01-30 15:21:34 UTC
FWIW: I have just updated my xw8400 with a newer motherboard and a nwer bios 0.22
I now see this exact issue on that system as well. Prior to upgrading I was able
to install and use the system. 

Here is aome additional information that is captured on the serial output. 

===========================================

Hello, world!

00:05 Timestamp: 00000000_BB3DBBF0 (982 ms)
  CPUID = F62
  Microcode update signature = 5
PCIExpInit()
  PCIExpNorthInit()
    Enable Parity Error on all Root Ports
    Clear PCI_IRQ_PIN on rootports 0,5,6,7
    Set high BUS numbers for the unused PCIEX rootports 5,6,7
    Init rootports 2,3,4 to INTA
    Set MPS, MRRS, URREN, FERE, NFERE and MCH_CERE
    BSU 0.52 - #5: Disable APIC EOI
    BSU 0.52 - #4: BPRI Hang when running CPU memoryLocks
    MCH Erratum 26 - FSB enters infinite retry when running specific test
    b0:d16:f0[.]Proc Enable Register: ISS suggestion & selftest
  PCIExpHardcodeDownstreamBus
  PCIExpSouthInit()
  PCIExpSetPowerLimits()
Hardware State:
Power Management Registers:
-ESB2 PM Registers:
  PM1_CNT=0000  SLP_TYP=0  S0
  PM1_STS= 0900  PRBTNOR PWRBTN
  PM1_EN=  0000
  STS&EN|R=0900  PRBTNOR PWRBTN
  GPE0_STS=FFFF0000  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
  GPE0_EN= 00000000
  STS&EN|R=00000000
-SIO PM Registers:
  PME_STS= 00 PME_EN= 01 STS&EN=00  PME=0
  PME_STS1=18 PME_EN1=00 STS&EN=00
  PME_STS2=44 PME_EN2=19 STS&EN=00
  PME_STS3=21 PME_EN3=00 STS&EN=00
  PME_STS4=0B PME_EN4=30 STS&EN=00
  PME_STS5=C0 PME_EN5=00 STS&EN=00
  PME_STS6=60 PME_EN6=00 STS&EN=00
  PME_STS7=09 PME_EN7=40 STS&EN=00
  PME_STS8=31 PME_EN8=40 STS&EN=00
  PME_STS9=00 PME_EN9=00 STS&EN=00
Time:  04:40:48 01/27/06
Alarm: 00:00:00
ESB2 GPIO Data:
  LPC_GPIOUSE     = FF0C39C3
  LPC_GPIOIOSEL   = F400FFFF
  LPC_GPIOLVL     = F60F0000
  LPC_GPIINV      = 00002040
  LPC_GPIOUSE_2   = 00000107
  LPC_GPIOIOSEL_2 = 00000300
  LPC_GPIOLVL_2   = 00010300
Initializing memory
00:09 Timestamp: 00000003_A9430D0C (4918 ms)
Setting up stack in memory
Setting up cache
00:09 Timestamp: 00000003_B23BCFA8 (4965 ms)
Returning to real mode
00:36 Timestamp: 00000003_E83D75E8 (5249 ms)
00:36 Timestamp: 00000004_208565B4 (5544 ms)
00:39 Timestamp: 00000007_46FC49B4 (9776 ms)
00:99 Timestamp: 00000007_F63E35E8 (10696 ms)
00:25 Timestamp: 00000008_F63A63B4 (12040 ms)
00:AF Timestamp: 00000009_9710AF48 (12884 ms)
00:AF Timestamp: 00000009_9BE161FC (12909 ms)
00:AF Timestamp: 00000009_A04661FC (12932 ms)
00:26 Timestamp: 00000009_B0699C7C (13017 ms)
00:26 Timestamp: 00000009_B6A2D8E8 (13049 ms)
00:26 Timestamp: 00000009_BC42B9C8 (13079 ms)
00:26 Timestamp: 00000009_C0A776A8 (13102 ms)
00:26 Timestamp: 00000009_C59698C8 (13128 ms)
00:26 Timestamp: 00000009_CA1EA0E8 (13152 ms)
00:26 Timestamp: 00000009_EB3CBCFC (13325 ms)
00:26 Timestamp: 00000009_EFA381B4 (13349 ms)
00:26 Timestamp: 00000009_F40A3CE8 (13372 ms)
00:26 Timestamp: 00000009_F86D3FFC (13395 ms)
00:26 Timestamp: 0000000A_041774B4 (13456 ms)
00:26 Timestamp: 0000000A_087C97E8 (13479 ms)
00:2B Timestamp: 0000000A_134A36C8 (13536 ms)
00:2B Timestamp: 00000021_E8BA677C (45556 ms)


Hello, world!

00:05 Timestamp: 00000000_BE0B1614 (997 ms)
  CPUID = F62
  Microcode update signature = 5
PCIExpInit()
  PCIExpNorthInit()
    Enable Parity Error on all Root Ports
    Clear PCI_IRQ_PIN on rootports 0,5,6,7
    Set high BUS numbers for the unused PCIEX rootports 5,6,7
    Init rootports 2,3,4 to INTA
    Set MPS, MRRS, URREN, FERE, NFERE and MCH_CERE
    BSU 0.52 - #5: Disable APIC EOI
    BSU 0.52 - #4: BPRI Hang when running CPU memoryLocks
    MCH Erratum 26 - FSB enters infinite retry when running specific test
    b0:d16:f0[.]Proc Enable Register: ISS suggestion & selftest
  PCIExpHardcodeDownstreamBus
  PCIExpSouthInit()
  PCIExpSetPowerLimits()
Hardware State:
Power Management Registers:
-ESB2 PM Registers:
  PM1_CNT=0000  SLP_TYP=0  S0
  PM1_STS= 0000
  PM1_EN=  0000
  STS&EN|R=0000
  GPE0_STS=F7FF0000  31 30 29 28 26 25 24 23 22 21 20 19 18 17 16
  GPE0_EN= 00000000
  STS&EN|R=00000000
-SIO PM Registers:
  PME_STS= 00 PME_EN= 00 STS&EN=00  PME=0
  PME_STS1=18 PME_EN1=00 STS&EN=00
  PME_STS2=00 PME_EN2=19 STS&EN=00
  PME_STS3=01 PME_EN3=00 STS&EN=00
  PME_STS4=03 PME_EN4=30 STS&EN=00
  PME_STS5=00 PME_EN5=00 STS&EN=00
  PME_STS6=60 PME_EN6=00 STS&EN=00
  PME_STS7=08 PME_EN7=40 STS&EN=00
  PME_STS8=28 PME_EN8=40 STS&EN=00
  PME_STS9=00 PME_EN9=00 STS&EN=00
Time:  04:42:38 01/27/06
Alarm: 00:00:00
ESB2 GPIO Data:
  LPC_GPIOUSE     = FF0C39C3
  LPC_GPIOIOSEL   = F400FFFF
  LPC_GPIOLVL     = F60F0000
  LPC_GPIINV      = 00002040
  LPC_GPIOUSE_2   = 00000107
  LPC_GPIOIOSEL_2 = 00000300
  LPC_GPIOLVL_2   = 00010300
Initializing memory
00:09 Timestamp: 00000003_AB30C954 (4928 ms)
Setting up stack in memory
Setting up cache
00:09 Timestamp: 00000003_B4277FC8 (4975 ms)
Returning to real mode
00:36 Timestamp: 00000003_EA1E1E30 (5259 ms)
00:36 Timestamp: 00000004_225C0530 (5554 ms)
00:39 Timestamp: 00000007_44E3A990 (9765 ms)
00:99 Timestamp: 00000007_F3882FB0 (10682 ms)
00:25 Timestamp: 00000008_F28F3210 (12020 ms)
00:AF Timestamp: 00000009_8DDE05D0 (12835 ms)
00:AF Timestamp: 00000009_92B5BAD8 (12861 ms)
00:AF Timestamp: 00000009_971E6FB0 (12884 ms)
00:26 Timestamp: 00000009_A73A28B0 (12968 ms)
00:26 Timestamp: 00000009_AD6F6844 (13001 ms)
00:26 Timestamp: 00000009_B30F2350 (13031 ms)
00:26 Timestamp: 00000009_B7700158 (13054 ms)
00:26 Timestamp: 00000009_BC163264 (13078 ms)
00:26 Timestamp: 00000009_C09E2944 (13102 ms)
00:26 Timestamp: 00000009_D4132310 (13204 ms)
00:26 Timestamp: 00000009_D8721B90 (13227 ms)
00:26 Timestamp: 00000009_DCD6FDE4 (13250 ms)
00:26 Timestamp: 00000009_E13F6CE4 (13273 ms)
00:26 Timestamp: 00000009_ED71E944 (13337 ms)
00:26 Timestamp: 00000009_F1D8F764 (13360 ms)
00:2B Timestamp: 00000009_FC68FAE4 (13416 ms)
00:2B Timestamp: 0000001C_E2EA7264 (38808 ms)


Comment 27 Chris Williams 2006-02-10 20:16:45 UTC
HP's latest latest findings:

1- Install RHEL4U3-x86_64 on an xw8400 with an FX540 or FX1400 and 16GB of RAM.

After the install completes, boot the RHEL4U3-x86-64 UP kernel with "mem=16G", 
The system boots in the usual amount of time.  Now boot the UP kernel without
the mem kernel option and the system is extremely slow (appears hung) to boot.

What's the difference (to the kernel) of 16G of physical memory and mem=16G?

2- Try the same thing at install time - e.g. "boot: linux mem=16G", the system
panics

...
RAMDISK: Couldn't find a valid RAM disk image starting at 0
EXT2-fs: Unable to read superblock
iosfs_fill_super bread failed, dev=md1, iso_blknum=16, block=32
kernel panic - not syscing: VFS: Unable to mount root fs on unknown block(9,1)

Comment 28 Chris Williams 2006-02-10 20:17:41 UTC
From the HP BIOS engineer:
The mapping looks correct to me. You have ranges 0-3.25G and 4-16.75G allocated
to RAM, for a total of 3.25+12.75=16G. It looks like remapping is off on the xw9300.

Comment 29 Chris Williams 2006-02-10 20:19:18 UTC
From HP:

As stated in the initial summary - if I switch from the FX540 card to the FX1300
card, the xw8400 w/ 16G comes up quickly and normally.  The e820 map when the
FX1300 is present looks a bit different
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000bffde800 (usable)
BIOS-e820: 00000000bffde800 - 00000000c0000000 (reserved)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000440000000 (usable) 

Comment 30 Jeff Burke 2006-02-10 22:44:34 UTC
I have just installed the 0.25 bios that I received from Jeff Burell. Both RHEL3
U7 x86_64 and RHEL4 x86_64 installed succesfully.

I believe that using the latest bios 0.25 is a valid workaround.

Note: That the xw8400 and the xw6400 require different bios packages. They are
not longer the same image.

Also I believe that we should open a new bug for Comment #29. We have a system
here Jeff Needle the same issue. Booting a machine w/16 gig. teh system crawls.
If you add the mem=16G the system operates normally.

Comment 35 Chris Williams 2006-03-10 20:30:14 UTC
Created attachment 125960 [details]
xw6400 BIOS update

Comment 36 Chris Williams 2006-03-10 20:31:45 UTC
Created attachment 125961 [details]
xw8400 BIOS update

Comment 38 Jim Paradis 2006-03-16 22:41:40 UTC
I applied the BIOS update from Comment 36 to the xw8400 we have here in
Westford, and promptly bricked the system.  Any idea why this would have
happened?  Could we obtain an actual flash image to program into the ROM?  We
might have access to a ROM burner here...


Comment 39 Jim Paradis 2006-03-23 20:05:57 UTC
The fix for the initial issue was posted to rhkernel-list and is pending for
inclusion in a future update, so I am changing the bug status to POST.

As for the issue from Comment 29 onward, that has been filed as a separate
issue, Bug 181349.  This issue has been determined to be a HP BIOS bug and has
been closed as NOTABUG.


Comment 40 Jeff Burrell 2006-03-23 23:18:33 UTC
Created attachment 126584 [details]
HP xw8400 system BIOS update package, version 0.28

To use this package, you should copy the file to a DOS-bootable media(eg.
floppy, USB key).  Reboot the system and allow the system to boot to DOS from
that media. Once in DOS, simply execute this file to update the system BIOS
image to version 0.28.

Comment 41 Bob Johnson 2006-04-11 16:24:35 UTC
This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 4.4 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 4.4 release.

Comment 43 Jeff Burke 2006-05-02 18:25:37 UTC
Upgrading from .27 to .35 version of BIOs from caused my xw6400 system to stop
booting RHEL4 x86_64. Below is the last thing I see printed to the console.

CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
using mwait in idle threads.
CPU0: Physical Processor ID: 0
CPU0: Processor Core ID: 0
CPU0: Initial APIC ID: 0

Going back to bios .27 allows the system to boot again. Perhaps someone at HP
could let me know what version of the BIOs we should be running?

-Jeff

Comment 44 Jeff Burke 2006-05-08 16:56:43 UTC
FYI:
Jeff Burrell(HP) sent me verison 0.32 for the xw6400. Loading this worked.

Comment 45 Jason Baron 2006-05-09 15:01:11 UTC
committed in stream U4 build 35.4. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 50 Red Hat Bugzilla 2006-08-10 21:48:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html



Note You need to log in before you can comment on or make changes to this bug.