Description of problem: After installing a 5.2 distro and then upgrading to a 5.3 kernel (-116.el5), the disk seems to get corrupted and go bad. I believe this to be a problem with the updated sata stack because we can install a 5.2 kernel reliably but not the 5.3 kernel. This seems to occur reliably on hp-xw9300-01.rhts.bos.redhat.com with the chipset IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) I'll attach the boot log showing the errors. Some relevant info from the boot: Loading jbd.ko module Loading ext3.ko module Loading scsi_mod.ko module SCSI subsystem initialized Loading sd_mod.ko module Loading scsi_transport_spi.ko module Loading mptbase.ko module Fusion MPT base driver 3.04.07 Copyright (c) 1999-2008 LSI Corporation Loading mptscsih.ko module Loading mptspi.ko module Fusion MPT SPI Host driver 3.04.07 ACPI: PCI Interrupt 0001:61:06.0[A] -> GSI 30 (level, low) -> IRQ 209 mptbase: ioc0: Initiating bringup ioc0: LSI53C1030 B2: Capabilities={Initiator,Target} scsi0 : ioc0: LSI53C1030 B2, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=209 Vendor: SEAGATE Model: ST336607LW Rev: 0007 Type: Direct-Access ANSI SCSI revision: 03 target0:0:0: Beginning Domain Validation target0:0:0: Ending Domain Validation target0:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RTI WRFLOW PCOMP (6.25 ns, offset 63) SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) sda: Write Protect is off SCSI device sda: drive cache: write back w/ FUA SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) sda: Write Protect is off SCSI device sda: drive cache: write back w/ FUA sda: sda1 sda2 sd 0:0:0:0: Attached scsi disk sda ACPI: PCI Interrupt 0001:61:06.1[B] -> GSI 31 (level, low) -> IRQ 217 mptbase: ioc1: Initiating bringup ioc1: LSI53C1030 B2: Capabilities={Initiator,Target} scsi1 : ioc1: LSI53C1030 B2, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=217 Loading libata.ko module Loading sata_nv.ko module ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 21 ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [LSA0] -> GSI 21 (level, high) -> IRQ 225 scsi2 : sata_nv scsi3 : sata_nv ata1: SATA max UDMA/133 cmd 0x28d0 ctl 0x28f8 bmdma 0x28b0 irq 225 ata2: SATA max UDMA/133 cmd 0x28d8 ctl 0x28fc bmdma 0x28b8 irq 225 ata1: SATA link down (SStatus 0 SControl 300) ata2: SATA link down (SStatus 0 SControl 300) ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 20 ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [LSA1] -> GSI 20 (level, high) -> IRQ 233 scsi4 : sata_nv scsi5 : sata_nv ata3: SATA max UDMA/133 cmd 0x28e0 ctl 0x2c00 bmdma 0x28c0 irq 233 ata4: SATA max UDMA/133 cmd 0x28e8 ctl 0x2c04 bmdma 0x28c8 irq 233 ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) Loading dm-mod.ko module <snip> Starting atd: [ OK ]^M Starting yum-updatesd: [ OK ]^M Starting Avahi daemon... [ OK ]^M Starting HAL daemon: [ OK ]^M Starting RHTS testing: Running with correct RECIPEID. 09/18/08 23:23:37 recipeID:107981 start: Collecting all rpm packages... Sending rpm info to http://rhts.redhat.com/cgi-bin/rhts/scheduler_xmlrpc.cgi resp = client.results.allRpms(recipeid, pkg_list) 922164:/distribution/install has already run.. /usr/bin/rhts-test-runner.sh: line 91: [: missing `]' /mnt/tests/distribution/kernelinstall / end_request: I/O error, dev sda, sector 30880077 Buffer I/O error on device dm-0, logical block 3833856 lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 3833857 lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 3833858 lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 3833859 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 31931349 Buffer I/O error on device dm-0, logical block 3965265 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 35440997 end_request: I/O error, dev sda, sector 31931373 Buffer I/O error on device dm-0, logical block 3965268 lost page write due to I/O error on dm-0 end_request: I/O error, dev sda, sector 31931429 Buffer I/O error on device dm-0, logical block 3965275 lost page write due to I/O error on dm-0 Buffer I/O error on device dm-0, logical block 3965276 <snip> Version-Release number of selected component (if applicable): kernel-2.6.18-116.el5 has issues. Other kernels may have issues too, but the specific hardware wasn't choosen for testing. How reproducible: Very reliably Steps to Reproduce: 1. install 5.2 distro 2. boot latest 5.3 kernel 3. dies before getting to a login prompt. Actual results: Expected results: Additional info:
Created attachment 317378 [details] the boot output The attachment will say it is a binary file which is sort of true because it has some binary chars inside. Using less/more or any other text editor is suffcient for opening and reading.
Don, It looks like sda is controlled by the mpt fusion driver, are you only seeing problems with this particular setup? I am sure that I tested the sata_nv driver with this chipset, are there other sata_nv systems failing? Thanks, David
David, I don't know of any other sata_nv platforms, but yes this is the only system that is failing so far. Your best bet would probably to reserve the system from rhts and put the latest 5.3 kernel on there and see what happens. Do you think this can be an mpt fusion driver problem? I saw sata_nv and just assumed sata. -Don
AFAICS, you are getting corruption on the disk attached to the MPT Fusion, which has nothing to do with SATA or sata_nv.
This happened again with Don Zickus test kernel last night. http://rhts.redhat.com/testlogs/30555/110385/938631/4460756-test_log--distribution-kernelinstall-EXTERNALWATCHDOG.log
This bug seems to have been fixed with either bug 463206 or bug 463709. Maybe Mike can clue me in, as they were his scsi bugs. -Don
This is a dup of 463709 where we did not retry QUEUE_FULLs enough. *** This bug has been marked as a duplicate of bug 463709 ***