Bug 463202

Summary: rhts: MPT fusion controller corrupts disk
Product: Red Hat Enterprise Linux 5 Reporter: Don Zickus <dzickus>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED DUPLICATE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: dzickus, jburke, jgarzik, mchristi
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: Regression
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-13 17:09:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
the boot output none

Description Don Zickus 2008-09-22 15:31:49 UTC
Description of problem:

After installing a 5.2 distro and then upgrading to a 5.3 kernel (-116.el5), the disk seems to get corrupted and go bad.  I believe this to be a problem with the updated sata stack because we can install a 5.2 kernel reliably but not the 5.3 kernel.

This seems to occur reliably on hp-xw9300-01.rhts.bos.redhat.com
with the chipset
IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)

I'll attach the boot log showing the errors.

Some relevant info from the boot:

Loading jbd.ko module
Loading ext3.ko module
Loading scsi_mod.ko module
SCSI subsystem initialized
Loading sd_mod.ko module
Loading scsi_transport_spi.ko module
Loading mptbase.ko module
Fusion MPT base driver 3.04.07
Copyright (c) 1999-2008 LSI Corporation
Loading mptscsih.ko module
Loading mptspi.ko module
Fusion MPT SPI Host driver 3.04.07
ACPI: PCI Interrupt 0001:61:06.0[A] -> GSI 30 (level, low) -> IRQ 209
mptbase: ioc0: Initiating bringup
ioc0: LSI53C1030 B2: Capabilities={Initiator,Target}
scsi0 : ioc0: LSI53C1030 B2, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=209
  Vendor: SEAGATE   Model: ST336607LW        Rev: 0007
  Type:   Direct-Access                      ANSI SCSI revision: 03
 target0:0:0: Beginning Domain Validation
 target0:0:0: Ending Domain Validation
 target0:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RTI WRFLOW PCOMP (6.25 ns, offset 63)
SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back w/ FUA
SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back w/ FUA
 sda: sda1 sda2
sd 0:0:0:0: Attached scsi disk sda
ACPI: PCI Interrupt 0001:61:06.1[B] -> GSI 31 (level, low) -> IRQ 217
mptbase: ioc1: Initiating bringup
ioc1: LSI53C1030 B2: Capabilities={Initiator,Target}
scsi1 : ioc1: LSI53C1030 B2, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=217
Loading libata.ko module
Loading sata_nv.ko module
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 21
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [LSA0] -> GSI 21 (level, high) -> IRQ 225
scsi2 : sata_nv
scsi3 : sata_nv
ata1: SATA max UDMA/133 cmd 0x28d0 ctl 0x28f8 bmdma 0x28b0 irq 225
ata2: SATA max UDMA/133 cmd 0x28d8 ctl 0x28fc bmdma 0x28b8 irq 225
ata1: SATA link down (SStatus 0 SControl 300)
ata2: SATA link down (SStatus 0 SControl 300)
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 20
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [LSA1] -> GSI 20 (level, high) -> IRQ 233
scsi4 : sata_nv
scsi5 : sata_nv
ata3: SATA max UDMA/133 cmd 0x28e0 ctl 0x2c00 bmdma 0x28c0 irq 233
ata4: SATA max UDMA/133 cmd 0x28e8 ctl 0x2c04 bmdma 0x28c8 irq 233
ata3: SATA link down (SStatus 0 SControl 300)
ata4: SATA link down (SStatus 0 SControl 300)
Loading dm-mod.ko module

<snip>

Starting atd: [  OK  ]^M
Starting yum-updatesd: [  OK  ]^M
Starting Avahi daemon... [  OK  ]^M
Starting HAL daemon: [  OK  ]^M
Starting RHTS testing: Running with correct RECIPEID.
09/18/08 23:23:37  recipeID:107981 start:
Collecting all rpm packages...
Sending rpm info to http://rhts.redhat.com/cgi-bin/rhts/scheduler_xmlrpc.cgi
resp = client.results.allRpms(recipeid, pkg_list)
922164:/distribution/install has already run..
/usr/bin/rhts-test-runner.sh: line 91: [: missing `]'
/mnt/tests/distribution/kernelinstall /
end_request: I/O error, dev sda, sector 30880077
Buffer I/O error on device dm-0, logical block 3833856
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 3833857
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 3833858
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 3833859
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 31931349
Buffer I/O error on device dm-0, logical block 3965265
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 35440997
end_request: I/O error, dev sda, sector 31931373
Buffer I/O error on device dm-0, logical block 3965268
lost page write due to I/O error on dm-0
end_request: I/O error, dev sda, sector 31931429
Buffer I/O error on device dm-0, logical block 3965275
lost page write due to I/O error on dm-0
Buffer I/O error on device dm-0, logical block 3965276

<snip>

Version-Release number of selected component (if applicable):
kernel-2.6.18-116.el5 has issues.  Other kernels may have issues too, but the specific hardware wasn't choosen for testing.


How reproducible:
Very reliably

Steps to Reproduce:
1. install 5.2 distro
2. boot latest 5.3 kernel
3. dies before getting to a login prompt.
  
Actual results:


Expected results:


Additional info:

Comment 1 Don Zickus 2008-09-22 15:34:59 UTC
Created attachment 317378 [details]
the boot output

The attachment will say it is a binary file which is sort of true because it has some binary chars inside.  Using less/more or any other text editor is suffcient for opening and reading.

Comment 2 David Milburn 2008-09-22 22:02:35 UTC
Don,

It looks like sda is controlled by the mpt fusion driver, are you only
seeing problems with this particular setup? I am sure that I tested the
sata_nv driver with this chipset, are there other sata_nv systems failing?

Thanks,
David

Comment 3 Don Zickus 2008-09-23 13:45:47 UTC
David,

I don't know of any other sata_nv platforms, but yes this is the only system that is failing so far.  Your best bet would probably to reserve the system from rhts and put the latest 5.3 kernel on there and see what happens.  Do you think this can be an mpt fusion driver problem? I saw sata_nv and just assumed sata.

-Don

Comment 4 Jeff Garzik 2008-09-23 13:50:50 UTC
AFAICS, you are getting corruption on the disk attached to the MPT Fusion, which has nothing to do with SATA or sata_nv.

Comment 5 Jeff Burke 2008-09-26 16:22:59 UTC
This happened again with Don Zickus test kernel last night.

http://rhts.redhat.com/testlogs/30555/110385/938631/4460756-test_log--distribution-kernelinstall-EXTERNALWATCHDOG.log

Comment 6 Don Zickus 2008-10-06 20:46:08 UTC
This bug seems to have been fixed with either bug 463206 or bug 463709.

Maybe Mike can clue me in, as they were his scsi bugs.

-Don

Comment 7 Mike Christie 2008-10-13 17:09:07 UTC
This is a dup of 463709 where we did not retry QUEUE_FULLs enough.

*** This bug has been marked as a duplicate of bug 463709 ***