At first I thought I was reproducing bug 174973, but I think there might be another issue here. The last FC4 errata kernel that appears to work reliably for me is also 2.6.13-1.1532_FC4smp. None of the 2.6.14-based kernels are stable. 2.6.14-1.1656smp fails to boot completely. Unolike bug 174973, my 2.6.14-1.1644smp survives RAID initialization, and begins booting, but late in the boot there are a bunch of disk errors and one of the disks drops off the RAID. The system continues to run off a single disk (I have two disks in a RAID-1 configuration). After rebuilding my arrays, I then attempted to boot 2.6.14-1.1656smp with a serial console. This time 2.6.14-1.1656smp survived RAID initialization, but then one of the disks dropped out due to phantom errors again. My understanding of bug 174973 is that it's a timeout issue that prevents the kernel from seeing the RAID devices. Here, it appears that I do see the RAID devices, but then one of the disks fail later in the boot process: Bootdata ok (command line is ro root=/dev/md1 console=ttyS0,9600 console=tty0) Linux version 2.6.14-1.1656_FC4smp (bhcompile.redhat.com) (gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)) #1 SMP Thu Jan 5 22:26:33 EST 2006 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009b400 (usable) BIOS-e820: 000000000009b400 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d6000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007ff70000 (usable) BIOS-e820: 000000007ff70000 - 000000007ff76000 (ACPI data) BIOS-e820: 000000007ff76000 - 000000007ff80000 (ACPI NVS) BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) Scanning NUMA topology in Northbridge 24 Number of nodes 2 Node 0 using interleaving mode 1/0 No NUMA configuration found Faking a node at 0000000000000000-000000007ff70000 Bootmem setup node 0 0000000000000000-000000007ff70000 ACPI: PM-Timer IO Port: 0x8008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 15:5 APIC version 16 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 15:5 APIC version 16 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfe500000] gsi_base[24]) IOAPIC[1]: apic_id 3, version 17, address 0xfe500000, GSI 24-27 ACPI: IOAPIC (id[0x04] address[0xfe501000] gsi_base[28]) IOAPIC[2]: apic_id 4, version 17, address 0xfe501000, GSI 28-31 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) Setting APIC routing to flat Using ACPI (MADT) for SMP configuration information Allocating PCI resources starting at 88000000 (gap: 80000000:7ec00000) Checking aperture... CPU 0: aperture @ 0 size 32 MB No AGP bridge found Built 1 zonelists Kernel command line: ro root=/dev/md1 console=ttyS0,9600 console=tty0 Initializing CPU#0 PID hash table entries: 4096 (order: 12, 131072 bytes) time.c: Using 3.579545 MHz PM timer. time.c: Detected 1403.212 MHz processor. Console: colour VGA+ 80x25 Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Memory: 2055212k/2096576k available (2335k kernel code, 40960k reserved, 1389k data, 236k init) Calibrating delay using timer specific routine.. 2812.38 BogoMIPS (lpj=5624776) Security Framework v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 0(1) -> Node 0 -> Core 0 mtrr: v2.0 (20020519) Using local APIC timer interrupts. Detected 12.528 MHz APIC timer. Booting processor 1/2 APIC 0x1 Initializing CPU#1 Calibrating delay using timer specific routine.. 2806.58 BogoMIPS (lpj=5613176) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU 1(1) -> Node 0 -> Core 0 AMD Opteron(tm) Processor 240 stepping 01 CPU 1: Syncing TSC to CPU 0. CPU 1: synchronized TSC with CPU 0 (last diff -3 cycles, maxerr 928 cycles) Brought up 2 CPUs Disabling vsyscall due to use of PM timer time.c: Using PM based timekeeping. testing NMI watchdog ... OK. checking if image is initramfs... it is NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using configuration type 1 ACPI: Subsystem revision 20050916 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (0000:00) PCI: Probing PCI hardware (bus 00) ACPI: PCI Interrupt Link [LNKA] (IRQs 3 5 10 *11) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 *5 10 11) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 5 *10 11) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 5 10 *11) Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init pnp: PnP ACPI: found 12 devices usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report PCI-DMA: Disabling IOMMU. pnp: 00:04: ioport range 0x4d0-0x4d1 has been reserved pnp: 00:04: ioport range 0x1100-0x117f has been reserved pnp: 00:04: ioport range 0x1180-0x11ff has been reserved PCI: Bridge: 0000:00:06.0 IO window: 2000-2fff MEM window: fd000000-fe0fffff PREFETCH window: 88000000-880fffff PCI: Bridge: 0000:00:0a.0 IO window: disabled. MEM window: fe100000-fe1fffff PREFETCH window: 88100000-881fffff PCI: Bridge: 0000:00:0b.0 IO window: 3000-3fff MEM window: fe200000-fe2fffff PREFETCH window: 88200000-882fffff IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $ audit: initializing netlink socket (disabled) audit(1138525891.648:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) SELinux: Registering netfilter hooks Initializing Cryptographic API ksign: Installing public key data Loading keyring - Added public key A42C93F7DA961759 - User ID: Red Hat, Inc. (Kernel Module GPG key) PCI: MSI quirk detected. pci_msi_quirk set. PCI: MSI quirk detected. pci_msi_quirk set. pci_hotplug: PCI Hot Plug PCI Core version: 0.5 ACPI: CPU0 (power states: C1[C1]) ACPI: CPU1 (power states: C1[C1]) Real Time Clock Driver v1.12 Linux agpgart interface v0.101 (c) Dave Jones PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12 serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ sharing enabled ÿttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 0000:00:07.1 AMD8111: chipset revision 3 AMD8111: not 100% native mode: will probe irqs later AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller ide0: BM-DMA at 0x1020-0x1027, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0x1028-0x102f, BIOS settings: hdc:DMA, hdd:pio hda: SAMSUNG SP1213N, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hdc: LITE-ON DVDRW LDW-811S, ATAPI CD/DVD-ROM drive ide1 at 0x170-0x177,0x376 on irq 15 hda: max request size: 1024KiB hda: 234493056 sectors (120060 MB) w/8192KiB Cache, CHS=16383/255/63, UDMA(100) hda: cache flushes supported hda: hda1 hdc: ATAPI 40X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.20 ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.6:USB HID core driver mice: PS/2 mouse device common for all mice md: md driver 0.90.2 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: bitmap version 3.39 NET: Registered protocol family 2 input: AT Translated Set 2 keyboard on isa0060/serio0 IP route cache hash table entries: 131072 (order: 8, 1048576 bytes) TCP established hash table entries: 131072 (order: 10, 4194304 bytes) TCP bind hash table entries: 65536 (order: 9, 2097152 bytes) TCP: Hash tables configured (established 131072 bind 65536) TCP reno registered TCP bic registered Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 powernow-k8: Power state transitions not supported powernow-k8: Power state transitions not supported Freeing unused kernel memory: 236k freed Write protecting the kernel read-only data: 824k SCSI subsystem initialized ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 29 (level, low) -> IRQ 169 scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11 <Adaptec 29320 Ultra320 SCSI adapter> aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs ACPI: PCI Interrupt 0000:03:01.1[B] -> GSI 30 (level, low) -> IRQ 177 scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11 <Adaptec 29320 Ultra320 SCSI adapter> aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs Vendor: SEAGATE Model: ST336607LW Rev: 0007 Type: Direct-Access ANSI SCSI revision: 03 target1:0:0: asynchronous. scsi1:A:0:0: Tagged Queuing enabled. Depth 4 target1:0:0: Beginning Domain Validation target1:0:0: wide asynchronous. target1:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM WRFLOW PCOMP (6.25 ns, offset 63) target1:0:0: Ending Domain Validation SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) SCSI device sda: drive cache: write back SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 Attached scsi disk sda at scsi1, channel 0, id 0, lun 0 Vendor: SEAGATE Model: ST336607LW Rev: 0007 Type: Direct-Access ANSI SCSI revision: 03 target1:0:1: asynchronous. scsi1:A:1:0: Tagged Queuing enabled. Depth 4 target1:0:1: Beginning Domain Validation target1:0:1: wide asynchronous. target1:0:1: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM WRFLOW PCOMP (6.25 ns, offset 63) target1:0:1: Ending Domain Validation SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB) SCSI device sdb: drive cache: write back SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB) SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 sdb3 Attached scsi disk sdb at scsi1, channel 0, id 1, lun 0 md: raid1 personality registered as nr 3 md: Autodetecting RAID arrays. md: autorun ... md: considering sdb3 ... md: adding sdb3 ... md: sdb2 has different UUID to sdb3 md: sdb1 has different UUID to sdb3 md: adding sda3 ... md: sda2 has different UUID to sdb3 md: sda1 has different UUID to sdb3 md: created md2 md: bind<sda3> md: bind<sdb3> md: running: <sdb3><sda3> raid1: raid set md2 active with 2 out of 2 mirrors md: considering sdb2 ... md: adding sdb2 ... md: sdb1 has different UUID to sdb2 md: adding sda2 ... md: sda1 has different UUID to sdb2 md: created md1 md: bind<sda2> md: bind<sdb2> md: running: <sdb2><sda2> raid1: raid set md1 active with 2 out of 2 mirrors md: considering sdb1 ... md: adding sdb1 ... md: adding sda1 ... md: created md0 md: bind<sda1> md: bind<sdb1> md: running: <sdb1><sda1> raid1: raid set md0 active with 2 out of 2 mirrors md: ... autorun DONE. md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. raid1: Disk failure on sdb2, disabling device. raid1: Disk failure on sdb1, disabling device. raid1: Disk failure on sdb3, disabling device.
This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you.
This bug still exists in the kernel update. However I managed to capture some good diagnostics this time, where "good diagnostics" means "the ugliest boot log I've seen to some time." The disks are initially detected and raid gets initialized. But soon thereafter aic7xxx begins reporting "scsi1: transmission error". The kernel valiantly tries to continue booting, with aic7xxx complaining every couple of seconds, and it manages to make quite a bit of progress before giving up and panicing. The last kernel that works reliably on this hardware is still 2.6.13 Attaching boot log.
Created attachment 124170 [details] 2.6.15 boot log. The final panic should probably be also looked at, but the root cause are all the preceding scsi exceptions. This hardware works fine on all kernel revs up until the 2.6.13 update.
Created attachment 124172 [details] Output of lspci
Created attachment 124173 [details] Output of dmidecode
Created attachment 127181 [details] 2.6.16 boot log. This bug still exists in 2.6.16-1.2069_FC4 Serial console log attached. NOTE: I tried booting first just with "noacpi pci=noacpi", only. The kernel initially continued to boot past the failure point, but then crashed with a "BUG: spinlock recursion". Thinking that "noacpi pci=noacpi" will suppress this bug, I attached a serial console, to capture what appears to be a different bug. Well, with the serial console, at 9600bps, the kernel crashed with SCSI errors again. So, "noacpi pci=noacpi" helps, but the bug still exists.
This bug appears to be fixed in 2.6.17