From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7 Description of problem: While trying to do a PXE install with RHEL4-U3 re1221.0 The xw6400 system will panic. Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8> <0>Kernel panic - not syncing: Oops <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff8014ff5e>{__queue_work+14} Version-Release number of selected component (if applicable): kernel-2.6.9-27.EL How reproducible: Always Steps to Reproduce: 1.Use the xw6400 hardware try to install RHEL4-U3-re1221.0 AS x86_64 Actual Results: ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at apic:305 invalid operand: 0000 [1] CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.9-27.EL RIP: 0010:[<ffffffff8054596d>] <ffffffff8054596d>{setup_local_APIC+27} RSP: 0000:0000010002173ef8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000050014 RCX: ffffffff8042d670 RDX: 0000000000000000 RSI: ffffffff8042d670 RDI: ffffffff8036bdee RBP: 0000000000000014 R08: 000000013ffcd9c0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110) Stack: 0000000000000004 0000000004040800 0000000000000000 ffffffff80545eb2 0000000000000001 0000000000000000 000000013ffcd9c0 ffffffff8010c4aa ffffffff8010c3d4 0000000004040800 Call Trace:<ffffffff80545eb2>{APIC_init_uniprocessor+133} <ffffffff8010c4aa>{init+214} <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8} <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0} Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8> <0>Kernel panic - not syncing: Oops <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff8014ff5e>{__queue_work+14} PML4 0 Oops: 0000 [2] CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.9-27.EL RIP: 0010:[<ffffffff8014ff5e>] <ffffffff8014ff5e>{__queue_work+14} RSP: 0000:ffffffff804a4358 EFLAGS: 00010046 RAX: ffffffff8044c188 RBX: 0000000000000000 RCX: ffffffff804f1be0 RDX: 0000000000000000 RSI: ffffffff8044c180 RDI: 0000000000000000 RBP: ffffffff804a4398 R08: ffffffff804a4398 R09: 0000000000000246 R10: 0000000000000246 R11: 0000010002173bc8 R12: 0000000000000062 R13: 0000010002173bc8 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110) Stack: ffffffff804a4398 0000000000000246 ffffffff8042f560 ffffffff8015009d 0000000000000246 ffffffff801446a1 0000000000000000 0000000000000246 ffffffff804a4398 ffffffff804a4398 Call Trace:<IRQ> <ffffffff8015009d>{queue_work+41} <ffffffff801446a1>{run_timer_softirq+591} <ffffffff8013faec>{__do_softirq+76} <ffffffff8013fb73>{do_softirq+49} <ffffffff80113c57>{do_IRQ+664} <ffffffff80110fdb>{ret_from_intr+0} <EOI> <ffffffff8021a58a>{__delay+9} <ffffffff8013954a>{panic+535} <ffffffff8011214e>{oops_end+159} <ffffffff80112169>{oops_end+186} <ffffffff8011225a>{die+54} <ffffffff801125e2>{do_invalid_op+145} <ffffffff8054596d>{setup_local_APIC+27} <ffffffff80111309>{error_exit+0} <ffffffff8054596d>{setup_local_APIC+27} <ffffffff80545969>{setup_local_APIC+23} <ffffffff80545eb2>{APIC_init_uniprocessor+133} <ffffffff8010c4aa>{init+214} <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8} <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0} Code: 48 81 3f 3c 4b 24 1d 48 89 f9 ba 52 00 00 00 0f 85 a2 00 00 RIP <ffffffff8014ff5e>{__queue_work+14} RSP <ffffffff804a4358> CR2: 0000000000000000 <0>Kernel panic - not syncing: Oops IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23 Expected Results: System should install with out a kernel panic. Additional info: We have two xw6400 prototype workstations, one is in RH-Westford(Jeff Burke) and one to RH-Raleigh(Chris Williams). These workstations are a smaller form factor version of the xw8400(But, Are not the same). They are based on the Intel "Greencreek" MCH, ESB-2 southbridge and "Dempsey" processor. Partner: Hewlett-Packard Co. RH Partner manager: Ron Pacheco System/component description: xw6400 prototype workstation - based on Intel "Greencreek" MCH and ESB2 southbridge Component list: - xw6400 base platform(motherboard, chassis, cables, power supply) - (2) dual core Intel Dempsey processors - (2) 512 MB FB-DIMM, for a total of 1 GB - (1) 80 GB SATA-II disk - (1) DVD-ROM drive - (1) floppy drive - (1) Nvidia NVS285 PCI-E graphics card FWIW: RHEL3-U7 installs and runs.
Created attachment 122597 [details] Console output of sootup just before the kernel panic
I smell a BIOS issue...does the box in question have the latest BIOS available? What previous RHEL or FC versions are known to install on this box?
According to HP it is at the latest and greatest. As of right now I know that RHEL3-U7 installs and runs. I just received this system the Thursday before the holiday break. I was told it is "New" hardware. I believe we only have two of these systems. One is with Chris W in RDU and I have the other. Not sure if this data is helpful right now but here is a lspci, lspci -n output from RHEL3-U7 on that system. 00:00.0 Host bridge: Intel Corporation Workstation Memory Controller Hub (rev 11) 00:02.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 2 (rev 11) 00:03.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 3 (rev 11) 00:04.0 PCI bridge: Intel Corporation Server PCI Express x16 Port 4-7 (rev 11) 00:05.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 5 (rev 11) 00:06.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 6 (rev 11) 00:07.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 7 (rev 11) 00:10.0 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11) 00:10.1 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11) 00:10.2 Host bridge: Intel Corporation Server Error Reporting Registers (rev 11) 00:11.0 Host bridge: Intel Corporation Reserved Registers (rev 11) 00:13.0 Host bridge: Intel Corporation Reserved Registers (rev 11) 00:15.0 Host bridge: Intel Corporation Server FBD Registers (rev 11) 00:16.0 Host bridge: Intel Corporation Server FBD Registers (rev 11) 00:1b.0 Audio device: Intel Corporation Enterprise Southbridge High Definition Audio (rev 09) 00:1c.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #3 (rev 09) 00:1d.3 USB Controller: Intel Corporation Enterprise Southbridge UHCI USB #4 (rev 09) 00:1d.7 USB Controller: Intel Corporation Enterprise Southbridge EHCI USB (rev 09) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) 00:1f.0 ISA bridge: Intel Corporation Enterprise Southbridge LPC (rev 09) 00:1f.1 IDE interface: Intel Corporation Enterprise Southbridge PATA (rev 09) 00:1f.2 IDE interface: Intel Corporation Enterprise Southbridge SATA cc=IDE (rev 09) 10:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Upstream Port (rev 01) 10:00.3 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express to PCI-X Bridge (rev 01) 1e:00.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Downstream Port E1 (rev 01) 1e:01.0 PCI bridge: Intel Corporation Enterprise Southbridge PCI Express Downstream Port E2 (rev 01) 1f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit Ethernet PCI Express (rev 01) 40:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1) 00:00.0 Class 0600: 8086:25c0 (rev 11) 00:02.0 Class 0604: 8086:25e2 (rev 11) 00:03.0 Class 0604: 8086:25e3 (rev 11) 00:04.0 Class 0604: 8086:25fa (rev 11) 00:05.0 Class 0604: 8086:25e5 (rev 11) 00:06.0 Class 0604: 8086:25e6 (rev 11) 00:07.0 Class 0604: 8086:25e7 (rev 11) 00:10.0 Class 0600: 8086:25f0 (rev 11) 00:10.1 Class 0600: 8086:25f0 (rev 11) 00:10.2 Class 0600: 8086:25f0 (rev 11) 00:11.0 Class 0600: 8086:25f1 (rev 11) 00:13.0 Class 0600: 8086:25f3 (rev 11) 00:15.0 Class 0600: 8086:25f5 (rev 11) 00:16.0 Class 0600: 8086:25f6 (rev 11) 00:1b.0 Class 0403: 8086:269a (rev 09) 00:1c.0 Class 0604: 8086:2690 (rev 09) 00:1d.0 Class 0c03: 8086:2688 (rev 09) 00:1d.1 Class 0c03: 8086:2689 (rev 09) 00:1d.2 Class 0c03: 8086:268a (rev 09) 00:1d.3 Class 0c03: 8086:268b (rev 09) 00:1d.7 Class 0c03: 8086:268c (rev 09) 00:1e.0 Class 0604: 8086:244e (rev d9) 00:1f.0 Class 0601: 8086:2670 (rev 09) 00:1f.1 Class 0101: 8086:269e (rev 09) 00:1f.2 Class 0101: 8086:2680 (rev 09) 10:00.0 Class 0604: 8086:3500 (rev 01) 10:00.3 Class 0604: 8086:350c (rev 01) 1e:00.0 Class 0604: 8086:3510 (rev 01) 1e:01.0 Class 0604: 8086:3514 (rev 01) 1f:00.0 Class 0200: 14e4:1600 (rev 01) 40:00.0 Class 0300: 10de:0165 (rev a1)
Chris Williams reports problems w/ RHEL3 on the xw6400 box as well, unless you use "noapic". That would seem to match the crash info from comment 1. Can you install RHEL4 if you use "noapic" on the kernel command line?
Strange that I can use RHEL3-U7 with no issues. What update is he using? Here is the results of trying RHEL4 with using the noapic option during boot. Bootdata ok (command line is initrd=redhat/rhel4-x86_64-AS/images/pxeboot/initrd.img ks=http://qafiler.boston.redhat.com/Netboot/kickstarts/rhel4-x86_64-AS.ks ramdisk_size=10000 BOOT_IMAGE=redhat/rhel4-x86_64-AS/images/pxeboot/vmlinuz console=ttyS0,115200 console=tty0 noapic) Linux version 2.6.9-27.EL (bhcompile.redhat.com) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 Tue Dec 20 19:11:47 EST 2005 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003ffde800 (usable) BIOS-e820: 000000003ffde800 - 0000000040000000 (reserved) BIOS-e820: 00000000e0000000 - 00000000e004cc96 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) No mptable found. DMI 2.4 present. ACPI: PM-Timer IO Port: 0xf808 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x04] enabled) Processor #4 15:6 APIC version 16 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] enabled) Processor #0 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) Processor #2 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x04] lapic_id[0x06] enabled) Processor #6 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x05] lapic_id[0x01] enabled) Processor #1 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x06] lapic_id[0x03] enabled) Processor #3 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x07] lapic_id[0x05] enabled) Processor #5 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled) Processor #7 15:6 APIC version 16 WARNING: NR_CPUS limit of 1 reached. Processor ignored. ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1]) Setting APIC routing to flat ACPI: Skipping IOAPIC probe due to 'noapic' option. Using ACPI for processor (LAPIC) configuration information Intel MultiProcessor Specification v1.4 Virtual Wire compatibility mode. OEM ID: HP <6>Product ID: Workstation <6>APIC at: 0xFEE00000 I/O APIC #1 Version 17 at 0xFEC00000. I/O APIC #2 Version 17 at 0xFEC10000. Processors: 1 Checking aperture... Built 1 zonelists Kernel command line: initrd=redhat/rhel4-x86_64-AS/images/pxeboot/initrd.img ks=http://qafiler.boston.redhat.com/Netboot/kickstarts/rhel4-x86_64-AS.ks ramdisk_size=10000 BOOT_IMAGE=redhat/rhel4-x86_64-AS/images/pxeboot/vmlinuz console=ttyS0,115200 console=tty0 noapic Initializing CPU#0 PID hash table entries: 4096 (order: 12, 131072 bytes) time.c: Using 3.579545 MHz PM timer. time.c: Detected 3200.270 MHz processor. time.c: Using PIT/TSC based timekeeping. Console: colour VGA+ 80x25 Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) Memory: 1021616k/1048440k available (2406k kernel code, 26152k reserved, 1306k data, 164k init) Calibrating delay using timer specific routine.. 6406.89 BogoMIPS (lpj=3203448) Security Scaffold v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode There is already a security framework initialized, register_security failed. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 (order: 0, 4096 bytes) CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 2048K using mwait in idle threads. CPU: Genuine Intel(R) CPU 3.20GHz stepping 02 ACPI: IRQ9 SCI: Edge set to Level Trigger. ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at apic:305 invalid operand: 0000 [1] CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.9-27.EL RIP: 0010:[<ffffffff8054596d>] <ffffffff8054596d>{setup_local_APIC+27} RSP: 0000:0000010002173ef8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000050014 RCX: ffffffff8042d670 RDX: 0000000000000000 RSI: ffffffff8042d670 RDI: ffffffff8036bdee RBP: 0000000000000014 R08: 0000000100000246 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff80538980(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo 0000010002172000, task 0000010002171110) Stack: 0000000000000004 0000000004040800 0000000000000000 ffffffff80545eb2 0000000000000001 0000000000000000 0000000100000246 ffffffff8010c4aa ffffffff8010c3d4 0000000004040800 Call Trace:<ffffffff80545eb2>{APIC_init_uniprocessor+133} <ffffffff8010c4aa>{init+214} <ffffffff8010c3d4>{init+0} <ffffffff801114bf>{child_rip+8} <ffffffff8010c3d4>{init+0} <ffffffff801114b7>{child_rip+0} Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8> <0>Kernel panic - not syncing: Oops
I just tried to insyall U3 x86_64 RHEL4-U3-re1221.0 on teh 6400 I have and it paniced, exact same message as comment #7 It panics with and with out noapic. ... Code: 0f 0b 3c c1 36 80 ff ff ff ff 31 01 48 8b 05 18 0e ee ff ff RIP <ffffffff8054596d>{setup_local_APIC+27} RSP <0000010002173ef8> <0>Kernel panic - not syncing: Oops
Chris also reports that using an i686 distro works, so the problem would seem to relate to the x86_64 arch support.
cww ok, I tested RHEL 3 x86_64 and it works fine so here is the break down cww RHEL4 x86 = works ok cww RHEL4 x86-64 = panics always cww RHEL3 x86 = SMP needs noapic UP is fine cww RHEL3 x86_64 = works ok
For the x86_64 version: This crash happens when a CPU tries to set up an APIC in flat mode whose ID does not match the APIC ID of any CPU on the system. I notice that the first CPU seen when we come up has APIC ID 4, not 0, and it's the only CPU in use. I'm wondering if this is another CPU-enumeration issue. Continuing to investigate.
The RHEL4U3 x86_64 panic described in comment #7 occurs on xw6400 AND xw8400 (with the latest 0.22 BIOS) when two processor packages are present. If you remove one processor package the panic goes away. Any idea why this panic is occuring?
In reading through the smpboot.c code, I believe we are getting to if (!smp_found_config && !acpi_lapic) .... APIC_init_unprocessor In order for !apic_lapic to be true, it looks as though the parse of the MADT LAPIC entries or the parse of the MADT IO-APIC entries must fail. Who sets up the Multiple APIC Description Table? BIOS or kernel? I'm looking to narrow down the scope of this issue so any data you can provide would be very much appreciated.
The only problem with that scenario is that in get_smp_config() we see: if (acpi_lapic && acpi_ioapic) { printk(KERN_INFO "Using ACPI (MADT) for SMP configuration information\n"); return; } else if (acpi_lapic) printk(KERN_INFO "Using ACPI for processor (LAPIC) configuration information\n"); The latter message appears in the console log of the crash, so acpi_lapic must be set already. I notice that acpi_ioapic is *not* set, so we may have something else going on here... continuing to investigate
If acpi_lapic is set - does this imply that the kernel was NOT able to parse the MADT LAPIC entries? If acpi_ioapic is not set - does this imply that the kernel was able to parse the MADT IO-APIC entries? From what you have found - Can we (from the BIOS perspective) focus on the MADT LAPIC entries as the place where there could be an issue? Or an I reading too much into this :-) Thanks for the update!
No, if acpi_lapic is set it implies that the kernel *was* able to parse the MADT LAPIC entries. Same for acpi_ioapic; it is set if we successfully parsed the MADT IO-APIC entries. count = acpi_table_parse(ACPI_APIC, acpi_parse_madt); if (count == 1) { /* * Parse MADT LAPIC entries */ error = acpi_parse_madt_lapic_entries(); if (!error) { acpi_lapic = 1; clustered_apic_check(); /* * Parse MADT IO-APIC entries */ if (!acpi_disabled) { error = acpi_parse_madt_ioapic_entries(); if (!error) { acpi_irq_model = ACPI_IRQ_MODEL_IOAPIC; acpi_irq_balance_set(NULL); acpi_ioapic = 1; smp_found_config = 1; } }
Okay, I think I've found the one-line fix for this. First of all, all the stuff from Comment 16 to Comment 19 are red herrings. For whatever bizarre reason, the first CPU ID reported in the MADT table is 4, not 0. This may or may not be what the BIOS writers intended, but it's something we should be able to handle. The BUG that we're tripping over is in setup_local_APIC(): if (!apic_id_registered()) BUG(); The reason this goes off is that the APIC ID of the running CPU is not found in phys_cpu_present_map. The call to APIC_init_uniprocessor() is entirely correct; since this is a uniprocessor kernel (even though it finds multiple CPUs in the MADT), this is what should be called. Leading up to the call of setup_local_APIC() is this: phys_cpu_present_map = physid_mask_of_physid(0); apic_write_around(APIC_ID, boot_cpu_id); setup_local_APIC(); The first line in this sequence is incorrect; it should read: phys_cpu_present_map = physid_mask_of_physid(boot_cpu_id); As long as the APIC ID of the boot CPU is 0, the old code just happens to work. As soon as it isn't, though, we fall over. This sequence occurs in a few other places in the kernel as well. I'm testing out a fix now.
The CPU enumeration is 0, 4, 2, 6, 1, 5, 3, 7 Where the 1st core in the CPU0 package = 0 1st core in the CPU1 package = 4 2nd core in the CPU0 package = 2 2nd core in the CPU1 package = 6 1st core w/ HT in the CPU0 package = 1 1st core w/ HT in the CPU1 package = 5 2nd core w/ HT in the CPU0 package = 3 2nd core w/ HT in the CPU1 package = 7 Does that line up with what you are seeing? From what I read, you are somehow starting with the 1st core in CPU1 instead of the 1st core in CPU0.
The kernel is seeing a different order in the MADT. Snipped from the console output: Processor #4 15:6 APIC version 16 Processor #0 15:6 APIC version 16 Processor #2 15:6 APIC version 16 Processor #6 15:6 APIC version 16 Processor #1 15:6 APIC version 16 Processor #3 15:6 APIC version 16 Processor #5 15:6 APIC version 16 Processor #7 15:6 APIC version 16 I confirmed this on the 6400 we have in house
Three things - 1- Can you tell me where the kernel is picking up the processor #? SO, for instance, is the #4 value found in the first entry of the MADT table and the #0 value found in the second entry of the MADT table? 2- We would like to have a two-way between HP and RH on Monday (1/30) to discuss this as we still do not understand how the kernel is coming up with the above enumeration when two processor packages are present. 3- I am able to use the testboot.iso on an xw6400 to startup anaconda with 2 processor packages.
The kernel picks up the processor numbers from the LAPIC entries in the MADT. It processes the entries in the order they appear in the table and uses the APIC ID fields to assign APIC IDs to processors. So yes, the #4 value is found in the first entry and the #0 value is found in the second entry. After some more spec-reading and code-reading, I think there may indeed be a BIOS bug in either the way APIC IDs are assigned to CPUS, or the way that CPUs are ordered in the table. I didn't pay attention before to the "acpi_id" field in the MADT LAPIC entries. Note this excerpt from the boot log: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x04] enabled) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x00] enabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x06] enabled) ACPI: LAPIC (acpi_id[0x05] lapic_id[0x01] enabled) ACPI: LAPIC (acpi_id[0x06] lapic_id[0x03] enabled) ACPI: LAPIC (acpi_id[0x07] lapic_id[0x05] enabled) ACPI: LAPIC (acpi_id[0x08] lapic_id[0x07] enabled) Note that the ACPI IDs show up in order from 1 to 8, but the APIC IDs are not in the expected order (4, 0, 2, 6... rather than 0, 2, 4, 6). Either the table is being built wrong, or the wrong core is coming up as the boot processor. The kernel fix I posted is still needed in the general case, but what I describe above suggests you may want to revisit the BIOS and see if it's really doing what you intend it to do.
FWIW: I have just updated my xw8400 with a newer motherboard and a nwer bios 0.22 I now see this exact issue on that system as well. Prior to upgrading I was able to install and use the system. Here is aome additional information that is captured on the serial output. =========================================== Hello, world! 00:05 Timestamp: 00000000_BB3DBBF0 (982 ms) CPUID = F62 Microcode update signature = 5 PCIExpInit() PCIExpNorthInit() Enable Parity Error on all Root Ports Clear PCI_IRQ_PIN on rootports 0,5,6,7 Set high BUS numbers for the unused PCIEX rootports 5,6,7 Init rootports 2,3,4 to INTA Set MPS, MRRS, URREN, FERE, NFERE and MCH_CERE BSU 0.52 - #5: Disable APIC EOI BSU 0.52 - #4: BPRI Hang when running CPU memoryLocks MCH Erratum 26 - FSB enters infinite retry when running specific test b0:d16:f0[.]Proc Enable Register: ISS suggestion & selftest PCIExpHardcodeDownstreamBus PCIExpSouthInit() PCIExpSetPowerLimits() Hardware State: Power Management Registers: -ESB2 PM Registers: PM1_CNT=0000 SLP_TYP=0 S0 PM1_STS= 0900 PRBTNOR PWRBTN PM1_EN= 0000 STS&EN|R=0900 PRBTNOR PWRBTN GPE0_STS=FFFF0000 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 GPE0_EN= 00000000 STS&EN|R=00000000 -SIO PM Registers: PME_STS= 00 PME_EN= 01 STS&EN=00 PME=0 PME_STS1=18 PME_EN1=00 STS&EN=00 PME_STS2=44 PME_EN2=19 STS&EN=00 PME_STS3=21 PME_EN3=00 STS&EN=00 PME_STS4=0B PME_EN4=30 STS&EN=00 PME_STS5=C0 PME_EN5=00 STS&EN=00 PME_STS6=60 PME_EN6=00 STS&EN=00 PME_STS7=09 PME_EN7=40 STS&EN=00 PME_STS8=31 PME_EN8=40 STS&EN=00 PME_STS9=00 PME_EN9=00 STS&EN=00 Time: 04:40:48 01/27/06 Alarm: 00:00:00 ESB2 GPIO Data: LPC_GPIOUSE = FF0C39C3 LPC_GPIOIOSEL = F400FFFF LPC_GPIOLVL = F60F0000 LPC_GPIINV = 00002040 LPC_GPIOUSE_2 = 00000107 LPC_GPIOIOSEL_2 = 00000300 LPC_GPIOLVL_2 = 00010300 Initializing memory 00:09 Timestamp: 00000003_A9430D0C (4918 ms) Setting up stack in memory Setting up cache 00:09 Timestamp: 00000003_B23BCFA8 (4965 ms) Returning to real mode 00:36 Timestamp: 00000003_E83D75E8 (5249 ms) 00:36 Timestamp: 00000004_208565B4 (5544 ms) 00:39 Timestamp: 00000007_46FC49B4 (9776 ms) 00:99 Timestamp: 00000007_F63E35E8 (10696 ms) 00:25 Timestamp: 00000008_F63A63B4 (12040 ms) 00:AF Timestamp: 00000009_9710AF48 (12884 ms) 00:AF Timestamp: 00000009_9BE161FC (12909 ms) 00:AF Timestamp: 00000009_A04661FC (12932 ms) 00:26 Timestamp: 00000009_B0699C7C (13017 ms) 00:26 Timestamp: 00000009_B6A2D8E8 (13049 ms) 00:26 Timestamp: 00000009_BC42B9C8 (13079 ms) 00:26 Timestamp: 00000009_C0A776A8 (13102 ms) 00:26 Timestamp: 00000009_C59698C8 (13128 ms) 00:26 Timestamp: 00000009_CA1EA0E8 (13152 ms) 00:26 Timestamp: 00000009_EB3CBCFC (13325 ms) 00:26 Timestamp: 00000009_EFA381B4 (13349 ms) 00:26 Timestamp: 00000009_F40A3CE8 (13372 ms) 00:26 Timestamp: 00000009_F86D3FFC (13395 ms) 00:26 Timestamp: 0000000A_041774B4 (13456 ms) 00:26 Timestamp: 0000000A_087C97E8 (13479 ms) 00:2B Timestamp: 0000000A_134A36C8 (13536 ms) 00:2B Timestamp: 00000021_E8BA677C (45556 ms) Hello, world! 00:05 Timestamp: 00000000_BE0B1614 (997 ms) CPUID = F62 Microcode update signature = 5 PCIExpInit() PCIExpNorthInit() Enable Parity Error on all Root Ports Clear PCI_IRQ_PIN on rootports 0,5,6,7 Set high BUS numbers for the unused PCIEX rootports 5,6,7 Init rootports 2,3,4 to INTA Set MPS, MRRS, URREN, FERE, NFERE and MCH_CERE BSU 0.52 - #5: Disable APIC EOI BSU 0.52 - #4: BPRI Hang when running CPU memoryLocks MCH Erratum 26 - FSB enters infinite retry when running specific test b0:d16:f0[.]Proc Enable Register: ISS suggestion & selftest PCIExpHardcodeDownstreamBus PCIExpSouthInit() PCIExpSetPowerLimits() Hardware State: Power Management Registers: -ESB2 PM Registers: PM1_CNT=0000 SLP_TYP=0 S0 PM1_STS= 0000 PM1_EN= 0000 STS&EN|R=0000 GPE0_STS=F7FF0000 31 30 29 28 26 25 24 23 22 21 20 19 18 17 16 GPE0_EN= 00000000 STS&EN|R=00000000 -SIO PM Registers: PME_STS= 00 PME_EN= 00 STS&EN=00 PME=0 PME_STS1=18 PME_EN1=00 STS&EN=00 PME_STS2=00 PME_EN2=19 STS&EN=00 PME_STS3=01 PME_EN3=00 STS&EN=00 PME_STS4=03 PME_EN4=30 STS&EN=00 PME_STS5=00 PME_EN5=00 STS&EN=00 PME_STS6=60 PME_EN6=00 STS&EN=00 PME_STS7=08 PME_EN7=40 STS&EN=00 PME_STS8=28 PME_EN8=40 STS&EN=00 PME_STS9=00 PME_EN9=00 STS&EN=00 Time: 04:42:38 01/27/06 Alarm: 00:00:00 ESB2 GPIO Data: LPC_GPIOUSE = FF0C39C3 LPC_GPIOIOSEL = F400FFFF LPC_GPIOLVL = F60F0000 LPC_GPIINV = 00002040 LPC_GPIOUSE_2 = 00000107 LPC_GPIOIOSEL_2 = 00000300 LPC_GPIOLVL_2 = 00010300 Initializing memory 00:09 Timestamp: 00000003_AB30C954 (4928 ms) Setting up stack in memory Setting up cache 00:09 Timestamp: 00000003_B4277FC8 (4975 ms) Returning to real mode 00:36 Timestamp: 00000003_EA1E1E30 (5259 ms) 00:36 Timestamp: 00000004_225C0530 (5554 ms) 00:39 Timestamp: 00000007_44E3A990 (9765 ms) 00:99 Timestamp: 00000007_F3882FB0 (10682 ms) 00:25 Timestamp: 00000008_F28F3210 (12020 ms) 00:AF Timestamp: 00000009_8DDE05D0 (12835 ms) 00:AF Timestamp: 00000009_92B5BAD8 (12861 ms) 00:AF Timestamp: 00000009_971E6FB0 (12884 ms) 00:26 Timestamp: 00000009_A73A28B0 (12968 ms) 00:26 Timestamp: 00000009_AD6F6844 (13001 ms) 00:26 Timestamp: 00000009_B30F2350 (13031 ms) 00:26 Timestamp: 00000009_B7700158 (13054 ms) 00:26 Timestamp: 00000009_BC163264 (13078 ms) 00:26 Timestamp: 00000009_C09E2944 (13102 ms) 00:26 Timestamp: 00000009_D4132310 (13204 ms) 00:26 Timestamp: 00000009_D8721B90 (13227 ms) 00:26 Timestamp: 00000009_DCD6FDE4 (13250 ms) 00:26 Timestamp: 00000009_E13F6CE4 (13273 ms) 00:26 Timestamp: 00000009_ED71E944 (13337 ms) 00:26 Timestamp: 00000009_F1D8F764 (13360 ms) 00:2B Timestamp: 00000009_FC68FAE4 (13416 ms) 00:2B Timestamp: 0000001C_E2EA7264 (38808 ms)
HP's latest latest findings: 1- Install RHEL4U3-x86_64 on an xw8400 with an FX540 or FX1400 and 16GB of RAM. After the install completes, boot the RHEL4U3-x86-64 UP kernel with "mem=16G", The system boots in the usual amount of time. Now boot the UP kernel without the mem kernel option and the system is extremely slow (appears hung) to boot. What's the difference (to the kernel) of 16G of physical memory and mem=16G? 2- Try the same thing at install time - e.g. "boot: linux mem=16G", the system panics ... RAMDISK: Couldn't find a valid RAM disk image starting at 0 EXT2-fs: Unable to read superblock iosfs_fill_super bread failed, dev=md1, iso_blknum=16, block=32 kernel panic - not syscing: VFS: Unable to mount root fs on unknown block(9,1)
From the HP BIOS engineer: The mapping looks correct to me. You have ranges 0-3.25G and 4-16.75G allocated to RAM, for a total of 3.25+12.75=16G. It looks like remapping is off on the xw9300.
From HP: As stated in the initial summary - if I switch from the FX540 card to the FX1300 card, the xw8400 w/ 16G comes up quickly and normally. The e820 map when the FX1300 is present looks a bit different BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000bffde800 (usable) BIOS-e820: 00000000bffde800 - 00000000c0000000 (reserved) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000440000000 (usable)
I have just installed the 0.25 bios that I received from Jeff Burell. Both RHEL3 U7 x86_64 and RHEL4 x86_64 installed succesfully. I believe that using the latest bios 0.25 is a valid workaround. Note: That the xw8400 and the xw6400 require different bios packages. They are not longer the same image. Also I believe that we should open a new bug for Comment #29. We have a system here Jeff Needle the same issue. Booting a machine w/16 gig. teh system crawls. If you add the mem=16G the system operates normally.
Created attachment 125960 [details] xw6400 BIOS update
Created attachment 125961 [details] xw8400 BIOS update
I applied the BIOS update from Comment 36 to the xw8400 we have here in Westford, and promptly bricked the system. Any idea why this would have happened? Could we obtain an actual flash image to program into the ROM? We might have access to a ROM burner here...
The fix for the initial issue was posted to rhkernel-list and is pending for inclusion in a future update, so I am changing the bug status to POST. As for the issue from Comment 29 onward, that has been filed as a separate issue, Bug 181349. This issue has been determined to be a HP BIOS bug and has been closed as NOTABUG.
Created attachment 126584 [details] HP xw8400 system BIOS update package, version 0.28 To use this package, you should copy the file to a DOS-bootable media(eg. floppy, USB key). Reboot the system and allow the system to boot to DOS from that media. Once in DOS, simply execute this file to update the system BIOS image to version 0.28.
This issue is on Red Hat Engineering's list of planned work items for the upcoming Red Hat Enterprise Linux 4.4 release. Engineering resources have been assigned and barring unforeseen circumstances, Red Hat intends to include this item in the 4.4 release.
Upgrading from .27 to .35 version of BIOs from caused my xw6400 system to stop booting RHEL4 x86_64. Below is the last thing I see printed to the console. CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 2048K using mwait in idle threads. CPU0: Physical Processor ID: 0 CPU0: Processor Core ID: 0 CPU0: Initial APIC ID: 0 Going back to bios .27 allows the system to boot again. Perhaps someone at HP could let me know what version of the BIOs we should be running? -Jeff
FYI: Jeff Burrell(HP) sent me verison 0.32 for the xw6400. Loading this worked.
committed in stream U4 build 35.4. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html