Bug 362371
Summary: | x86_64 largesmp kernel panics at boot time on DL585 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Larry Woodman <lwoodman> |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.6 | CC: | ddomingo, jbaron, jburke, jrfuller, tao |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0791 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-15 16:33:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 248673, 368411 |
Description
Larry Woodman
2007-11-01 18:59:57 UTC
I verified that the cause of this panic is linux-2.6.9-x8664-physflat.patch. If I remove this patch the system boots and runs fine, with this patch the system panics. This patch was introduced in 2.6.9-42.18 on Wed Oct 11 2006. Larry Woodman This could be a BIOS version issue: ------------------------------------------------------------------------------- This patch addresses Bug 192760. The bug as reported was that the largesmp x86_64 kernel did not support more than 8 AMD processors. The real issue is that AMD supports a "physical flat" APIC mode for >8 processors. We had not initially included this patch because I thought that clustered APIC mode would be sufficient for >8 processor support, and tests on a 12-way iWill box confirmed this. What I did not know is that AMD has deprecated clustered APIC mode in favor of physflat, and newer BIOSes only support the latter. AMD has verified this patch on their in-house largesmp systems, and I tested it successfully on the iWill box. ------------------------------------------------------------------------------- The failing systems have really old BIOS versions: # dmidecode 2.2 SMBIOS 2.3 present. 152 structures occupying 3941 bytes. Table at 0x000EC000. Handle 0x0000 DMI type 0, 20 bytes. BIOS Information Vendor: HP Version: A01 Release Date: 08/26/2005 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 2048 kB I updated the BIOs (in stages) on dl585-03 from 2006.01.20 to the latest (2007.02.14). The system continues to panic on boot with the largesmp kernel. I also checked another dl585 with 8G of RAM and it booted fine. How much RAM is in dl585-02? J dl585-02 has 96GB. Larry Johnray, the kernel that does boot is dl585-04:/usr/src/redhat/RPMS/x86_64/kernel-largesmp-2.6.9-65.test.EL.x86_64.rpm All I did was remove the linux-2.6.9-x8664-physflat.patch Larry I got a full capture of the error off of DL585-03 ***** ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 0000:00:04.1 AMD8111: chipset revision 3 AMD8111: not 100% native mode: will probe irqs later AMD8111: 0000:00:04.1 (rev 03) UDMA133 controller ide0: BM-DMA at 0x2000-0x2007, BIOS settings: hda:pio, hdb:pio hda: TSSTcorpCD-ROM TS-L162C, ATAPI CD/DVD-ROM drive Using cfq io scheduler ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide2: I/O resource 0x3EE-0x3EE not free. ide2: ports already in use, skipping probe hda: ATAPI 24X CD-ROM drive, 96kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.20 ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.0:USB HID core driver mice: PS/2 mouse device common for all mice input: AT Translated Set 2 keyboard on isa0060/serio0 input: PS/2 Generic Mouse on isa0060/serio1 md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 NET: Registered protocol family 2 IP route cache hash table entries: 524288 (order: 10, 4194304 bytes) TCP established hash table entries: 262144 (order: 10, 4194304 bytes) TCP bind hash table entries: 262144 (order: 10, 4194304 bytes) TCP: Hash tables configured (established 262144 bind 262144) Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 powernow-k8: Found 4 AMD Opteron (tm) Processor 885 processors (8 cpu cores) (version 2.00.00-rhel4) powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 powernow-k8: 0 : fid 0x12 (2600 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0xa powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xc powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xe powernow-k8: 4 : fid 0xa (1800 MHz), vid 0x10 ACPI wakeup devices: ACPI: (supports S0 S4 S5) Freeing unused kernel memory: 212k freed Red Hat nash version 4.2.1.13 starting Mounted /proc filesystem Mounting sysfs Creating /dev Starting udev Loading scsi_mod.ko module SCSI subsystem initialized Loading sd_mod.ko module Loading ehci-hcd.ko module Loading ohci-hcd.ko module ACPI: PCI Interrupt 0000:01:00.0[D] -> GSI 19 (level, low) -> IRQ 169 ohci_hcd 0000:01:00.0: OHCI Host Controller ohci_hcd 0000:01:00.0: irq 169, pci mem ffffff000002e000 ohci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1 hub 1-0:1.0: USB hub found hub 1-0:1.0: 3 ports detected ACPI: PCI Interrupt 0000:01:00.1[D] -> GSI 19 (level, low) -> IRQ 169 ohci_hcd 0000:01:00.1: OHCI Host Controller ohci_hcd 0000:01:00.1: irq 169, pci mem ffffff0000030000 ohci_hcd 0000:01:00.1: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found hub 2-0:1.0: 3 ports detected Loading uhci-hcd.ko module USB Universal Host Controller Interface driver v2.2 Loading usb-storage.ko module Initializing USB Mass Storage driver... usbcore: registered new driver usb-storage USB Mass Storage support registered. Waiting for driver initialization. Loading cciss.ko module HP CISS Driver (v 2.6.16.RH1) ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 18 (level, low) -> IRQ 193 cciss: using DAC cycles blocks= 71122559 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 blocks= 71122559 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 blocks= 71122559 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 cciss/c0d0: p1 p2 p3 p4 < p5 p6 > blocks= 71122559 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 cciss/c0d1: p1 p2 p3 Loading qla4xxx.QLogic iSCSI HBA Driver (ffffffffa007b000) ko module ACPI: PCI Interrupt 0000:05:0d.1[B] -> GSI 33 (level, low) -> IRQ 225 qla4xxx 0000:05:0d.1: Found an ISP4022, irq 0, iobase 0xffffff0000036000 Unable to handle kernel paging request at 0000000000006a80 RIP: <ffffffff8015eb2d>{__alloc_pages+117} PML4 7e2c9d067 PGD 37d8a067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: qla4xxx cciss usb_storage uhci_hcd ohci_hcd ehci_hcd sd_mod scsi_mod Pid: 580, comm: insmod Not tainted 2.6.9-65.ELlargesmp RIP: 0010:[<ffffffff8015eb2d>] <ffffffff8015eb2d>{__alloc_pages+117} RSP: 0000:00000107e2e2bd98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000040 RCX: 0000000000000220 RDX: 0000000000006a80 RSI: 0000000000000006 RDI: 0000000000000220 RBP: ffffffffffffffff R08: 00000000000927bf R09: 00000107e2ed03c8 R10: ffffffff8031b1a0 R11: 0000ffff804061c0 R12: 0000000000000220 R13: 0000000000006a80 R14: 0000000000000006 R15: 00000107e3f327f0 FS: 0000000000000000(0000) GS:ffffffff80506080(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000006a80 CR3: 00000007e3f88000 CR4: 00000000000006e0 Process insmod (pid: 580, threadinfo 00000107e2e2a000, task 00000107e3f327f0) Stack: 00000100f541aac8 0000000100004022 0000000000000000 ffffff0000036000 00000100f57d4008 0000000000000040 ffffffffffffffff 00000100f541aa30 0000000000022000 0000000000000220 Call Trace:<ffffffff801210b5>{dma_alloc_pages+125} <ffffffff801212dc>{dma_alloc_coherent+97} <ffffffffa007f527>{:qla4xxx:qla4xxx_probe_adapter+1227} <ffffffff801f6660>{pci_device_probe+110} <ffffffff8024c9c9>{bus_match+57} <ffffffff8024cac7>{driver_attach+68} <ffffffff8024cde3>{bus_add_driver+143} <ffffffff801f63d0>{pci_register_driver+119} <ffffffffa0098022>{:qla4xxx:qla4xxx_module_init+34} <ffffffff801505d2>{sys_init_module+278} <ffffffff8011026a>{system_call+126} Code: 49 8b 7d 00 31 c0 48 85 ff 0f 84 04 03 00 00 48 89 f8 48 2b RIP <ffffffff8015eb2d>{__alloc_pages+117} RSP <00000107e2e2bd98> CR2: 0000000000006a80 <0>Kernel panic - not syncing: Oops I tested kernel-largesmp-2.6.9-65.test.EL.x86_64.rpm on dl585-03 and it does work. J This is a problem with memory interleaving/numa=off. Any DL585 with interleaving enabled and/or numa=off on the bootline will panic when booting the largesmp kernel because it expects NUMA. When the system is NUMA it works OK. Larry This patch fixes the problem by fixing the linux-2.6.9-x8664-physflat.patch: --- linux-2.6.9/arch/x86_64/pci/k8-bus.c.orig +++ linux-2.6.9/arch/x86_64/pci/k8-bus.c @@ -59,6 +59,10 @@ fill_mp_bus_to_cpumask(void) } } + for (i = 0; i < 256; i++) + if (cpus_empty(pci_bus_to_cpumask[i])) + pci_bus_to_cpumask[i] = CPU_MASK_ALL; + return 0; } Larry Woodman so just to be clear, it seems that the bios is a red herring, and the only issue is numa setting, and this only impacts largesmp after 4u4? And largesmp never worked on opterons>8 before 4.5? committed in stream U6 build 66. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ Martin tested this kernel on all conbinations of numa=on/numa=off, memory ionterleaving enabled/disabled with the UP, SMP and LARGESMP kernel and reported that everything worked fine. Larry I have tested this upatched and the patched kernels on another DL585 G1 Dual core system. Here are the results: Memory Interleaving ON in BIOS: 2.6.9-65.EL largesmp = FAILED 2.6.9-66.EL largesmp = Success numa=off: 2.6.9-65.EL largesmp = FAILED 2.6.9-66.EL largesmp = Success I also tested these two kernels on a DL385 G2 and *both* kernels successfully booted. J An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html |