Hardware: "IBM iSeries x345" server, with "SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)" There are 6 slots for (removable) disks, all 6 disks are used. Software: Any kernel since the version 2.6.17 Description of problem: When write to any disk with target #2-#6 (i.e., any disk except the first one), the SCSI bus seems to be completely freezed. Kernel cannot reset it, but signals reset success on console. Writes to 1rst disk are OK. All the reads (from any disk) are OK. Both kernel and kernel-smp are affected. The problem appeared since the first update to kernel with version 2.6.17. Under the previous kernels (any <=2.6.16-1.2133, since RH7.3 time) all was OK. Additional info: I can make any tests needed (patches, re-compiling kernels etc.)
Created attachment 133336 [details] lspci output
Created attachment 133337 [details] lsmod output
It seems that "short writes" are OK. But writing a file which is >~50Mb always causes the freeze. The 2.6.26 fusion code ported to 2.6.17 is OK! Therefore the bug is in the newest MPT drivers version 3.03.09 (The previous version 3.03.07 which is used in the kernel-2.6.16 works OK in both 2.6.16 and 2.6.17). The drivers actually are: mptbase, mptscsih, mptspi, mptctl
The latest version which I can obtain in Internet is here: ftp://ftp.lsil.com/HostAdapterDrivers/linux/Fusion-MPT/fusion.tar.bz2 and has version 3.03.07 . (lsil.com seems to be the official LSI Logic site). Also there is a 3.03.08, in: ftp://ftp.lsil.com/HostAdapterDrivers/linux/Fusion-MPT/christoph/fusion.tar.bz2 and seems to be some devel stuff. Where the 2.6.17's 3.03.09 comes from? At first sight it seems that mptlinux-3.03.09 is some not yet stable code, as it is not present at the official LSILogic ftp site... Also browsing Internet I've found mentions about 3.03.10 version. I would try to test these versions too...
3.03.08 is BAD! i.e, the problem has appeared in mptlinux-3.03.08 .
The kernel-2.6.16 variant of "3.03.07" and the "vanilla" upstream "3.03.07" from ftp://ftp.lsil.com/HostAdapterDrivers/linux/Fusion-MPT/fusion.tar.bz2 differs. And the upstream variant is BAD too. Only the code from the 2.6.16 kernel source is OK.
Fortunately "the first disk" above just had lesser speed (80MB/s against 160MB/s for anotherr 5 disks). When I set 160Mb/s for the 1st disk, the psoblem appears with it too. Setting 80MB/s solves the problem (I hope) completely. IMHO it is some timing issue.
I'm also getting scsi errors. I'm trying FC6 test3 on a dell precision 690 workstation, x86_64 processor, etc. After running the system for a while (it varies considerably how long it lasts) I start getting errors like mptscsih: ioc0: task abort and eventually any attempt to access the disk causes the program to freeze. I'm going to double check this, but I don't think I ever had this problem using the kernel that originally shipped with FC6 test3, but I've definitely had it with both of the updates in the yum repository.
MPT upstream say that 2.6.16-->2.6.17 difference is that the system's scsi_transport_spi driver is now used by MPTs, whereas in 2.6.16 MPT used its own way... for comment #8: Garlett, Could you decrease the io speed of your disks and check whether your issues still exist? (Normally it can be done before the boot time in some system scsi bios settings etc.)
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
I've posted a note at bug # 208033 where I found kernel command line options that seem to help (fingers crossed) with my workstation using an FC 6 test release.
Hardware: "IBM eServer x225" server, with "SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)" I would also get "mptscsih: ioc0: task abort" until the machine would eventually freeze with 2.6.17-1.2187_FC5. Setting my drive speed in the scsi bios to 80MB/s (down from 160MB/s) solved this. I just tried 2.6.18-1.2200.fc5, and I got a kernel panic saying bad ext3 superblock and /dev and /proc could not be mounted. Rebooting back to 2.6.17-1.2187_FC5 (again at 80MB/s) works fine. For me at least, 2.6.18-1.2200.fc5 does not solve the problem, so I'd like to see this reopened. Thanks.
Kevin, Could you try a work-around, described here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208033#c10 ? > I just tried 2.6.18-1.2200.fc5, and I got a kernel panic Had it happened even with 80MB/s ??...
Kevin, are you using RAID on the box that won't boot 2200? If so, you'll need to rebuild the initrd with this command - mkinitrd --with=raid456 /boot/initrd-2.6.18-1.2200.fc5.img 2.6.18-1.2200.fc5 It should then boot.
DJ>mkinitrd --with=raid456 /boot/initrd-2.6.18-1.2200.fc5.img 2.6.18-1.2200.fc5 Dave, I am using software raid 5 on the box, and you were right with the initrd rebuild for 2200. I no longer get the kernel panic. Thanks! However, I still start seeing "mptscsih: ioc0: task abort" on my console at boot time. Eventually (after 3-4 hours), the screen is full of these messages, the keyboard is unresponsive, and I have to power cycle. DB>Could you try a work-around, described here: DB>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208033#c10 DB>Had it happened even with 80MB/s ??... Dmitry, sorry to report the same problem exists for me with the workaround on both 2.6.17-1.2187 and 2.6.18-1.2200. This is only happening at 160MB/s. When I switch to 80MB/s, both 2.6.17 and 2.6.18 appear fine with or without the work-around you listed. Is there anything else I can provide to help debug this? Thanks.
Created attachment 139036 [details] /var/log/messages from failure at 160MBs This is /var/log/messages for 2.6.18-1.2200 with scsi bios setting disks to 160MB/s with workaround: Kernel command line: ro root=/dev/md0 rhgb quiet mptscsih="width:0 factor:0x0A"
Created attachment 139037 [details] /var/log/messages from success at 80MBs This is /var/log/messages for 2.6.18-1.2200 with scsi bios setting disks to 80MB/s with workaround: Kernel command line: ro root=/dev/md0 rhgb quiet mptscsih="width:0 factor:0x0A"
Unfortunately, the initial issue still exists with the latest kernel-smp-2.6.18-1.2239.fc5. (The test is to copy big enough file (avi video) from one disk to another.) The difference is that the kernel now loops to "attempt to kill scsi task... SUCCESS... attempt to kill scsi task... SUCCESS..." etc. Previously, the kernel was freezed completely. Additionally, more time required to catch the bug (about 1-2 minutes instead of some seconds). The solution is still to decrease scsi io rate at the bios time (to the value of 80Mb/s and lesser). 160Mb/s is still bad. Comparing with the working case (when the test copy takes just some seconds to complete, even with manual "sync" after), under 160Mb/s it seems that hardware copy process takes much more time (1-2 minutes) and then fails as described above (message loop on the console). No idea what to do further...
Do you know what speed these drives can negotiate at? e.g. are they U320, U160, U80, or less. Could send me the inquiry data? You could obtain that by using the sg_inq tool from Douglas Gilberts site: http://sg.torque.net/sg/sg3_utils.html#mozTocId746876 I need both hex and verbose.
Okay, here's the result: [root@grograman ~]# sg_inq -v /dev/sda inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 4a 00 PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 SCCS=0 ACC=0 TGPS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=0 [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 Clocking=0x0 QAS=0 IUS=0 length=74 (0x4a) Peripheral device type: disk Vendor identification: ATA Product identification: WDC WD2500JS-75N Product revision level: 2E04 inquiry cdb: 12 01 00 00 fc 00 inquiry: requested 252 bytes but got 8 bytes inquiry cdb: 12 01 80 00 fc 00 inquiry: requested 252 bytes but got 24 bytes Unit serial number: WD-WCANK5485605 [root@grograman ~]# sg_inq -v /dev/sdb inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 4a 00 PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 SCCS=0 ACC=0 TGPS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=0 [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 Clocking=0x0 QAS=0 IUS=0 length=74 (0x4a) Peripheral device type: disk Vendor identification: ATA Product identification: WDC WD2500JS-75N Product revision level: 2E04 inquiry cdb: 12 01 00 00 fc 00 inquiry: requested 252 bytes but got 8 bytes inquiry cdb: 12 01 80 00 fc 00 inquiry: requested 252 bytes but got 24 bytes Unit serial number: WD-WCANK5487714 [root@grograman ~]# sg_inq -h /dev/sda standard INQUIRY: 00 00 00 05 12 45 00 00 02 41 54 41 20 20 20 20 20 ....E...ATA 10 57 44 43 20 57 44 32 35 30 30 4a 53 2d 37 35 4e WDC WD2500JS-75N 20 32 45 30 34 20 20 20 20 20 57 44 2d 57 43 41 4e 2E04 WD-WCAN 30 4b 35 34 38 35 36 30 35 00 00 00 77 1e a0 03 00 K5485605...w.... 40 03 20 0b fd 16 00 00 00 00 00 . ........ [root@grograman ~]# sg_inq -h /dev/sdb standard INQUIRY: 00 00 00 05 12 45 00 00 02 41 54 41 20 20 20 20 20 ....E...ATA 10 57 44 43 20 57 44 32 35 30 30 4a 53 2d 37 35 4e WDC WD2500JS-75N 20 32 45 30 34 20 20 20 20 20 57 44 2d 57 43 41 4e 2E04 WD-WCAN 30 4b 35 34 38 37 37 31 34 00 00 00 77 1e a0 03 00 K5487714...w.... 40 03 20 0b fd 16 00 00 00 00 00 . ........ Also, did you see my post at bug # 208033 about how it seemed to be heat related?
A clarification on my last post: I can't tell if you (=Eric Moore) were specifically asking for my (=Garrett Mitchener) drive info or Dmitri's. I posted mine since more data ought to be helpful, but at this point I'm not sure that Dmitri and I are seeing the same problem.
Garrett, It is much better to open a new bug (or use some existing) for your case. My case seems to be differ enough from yours (but the reason could be the same... ;) )
> are they U320, U160, U80, or less. All 6 disks are U160 > Could send me the inquiry data? > I need both hex and verbose. What "sg_inq" options exactly? Anyway, "sg_inq -v /dev/sdb": inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 a4 00 PQual=0 Device_type=0 RMB=0 version=0x03 [SPC] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 SCCS=0 ACC=0 TGPS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=1 Linked=1 [TranDis=1] CmdQue=1 Clocking=0x3 QAS=1 IUS=1 length=164 (0xa4) Peripheral device type: disk Vendor identification: IBM-ESXS Product identification: ST336732LC FN Product revision level: B84G inquiry cdb: 12 01 00 00 fc 00 inquiry: requested 252 bytes but got 17 bytes inquiry cdb: 12 01 80 00 fc 00 inquiry: requested 252 bytes but got 24 bytes Unit serial number: 3ET0WV0200007316BALV "sg_inq -h /dev/sdb": 00 00 00 03 12 9f 00 01 3e 49 42 4d 2d 45 53 58 53 .......>IBM-ESXS 10 53 54 33 33 36 37 33 32 4c 43 20 20 20 20 46 4e ST336732LC FN 20 42 38 34 47 33 45 54 30 57 56 30 32 30 37 33 30 B84G3ET0WV020730 30 42 38 34 38 20 20 20 20 0f 00 00 00 00 00 00 00 B848 ........ 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 60 00 00 31 30 30 30 30 32 32 31 30 00 30 30 30 31 ..100002210.0001 70 32 32 30 36 50 35 37 37 38 20 20 20 20 20 48 32 2206P5778 H2 80 33 33 34 33 20 20 20 20 31 39 4b 31 36 38 34 20 3343 19K1684 90 20 20 00 00 48 32 33 33 34 33 20 20 20 20 00 00 ..H23343 .. a0 00 00 00 00 All 6 disks are the same type.
Garrett Mitchener - you're issue is unrelated to this issue. You have a SAS controller, not a SPI(SCSI Parrallel Interface). Ultra 320, Ultra 160, etc only applies to SPI controllers. You have a SATA drive, and it probally only operates at one speed, which is 1 G/b. Dmitry - I will try to locate one of those drives so I can repro. Your inquiry data says you support QAS, which typically is there with U320 drives. Since that is enabled, Domain validation will be first tried at U320 negotiation, which would fail since you say they are only U160. Did you say you you tried U160 speeds? You would need to change the max_sync speed in BIOS CU(configuration utility), also can be done via SYSFS. Can you get a scsi bus trace? Have you checked to make sure your having proper LVD termination?
> Your inquiry data says you support QAS, which typically is there with U320 > drives. Hmmm... All 6 drives are 36.4Gb, and IMHO it is some first model for such size. All more recent disks we use now (in another computers) have a mark about "Ultra 160" on its cover, but these 6 still not. The cover contains (one of these 6 disks): EServer xSeries IBM P/N: 19K1684 IBM FRU: 32PO732 36.4 GB, USCSI 12AUG2002 Part Number: 9T3016-025 Serial Number: 3ET0Q3RV Lot Number: A-01-0307-3 Model Number: ST336732LC 11S19K1684ZJ1NB0002JTR > Did you say you tried U160 speeds? I'm not sure... I tried 160Mb/s, which was manufacturer default and worked fine atleast since "Fedora 1" time (i.e., kernel 2.4). Since the kernel 2.6.17 (when mpt began to use system spi code instead of own one), we are compelled to decrease to 80Mb/s (i.e. FAST-40) > You would need to change the max_sync speed in BIOS CU(configuration utility) Yes, of course > also can be done via SYSFS. Where exactly?.. > Can you get a scsi bus trace? I can, if you describe how to do this. > Have you checked to make sure your having proper LVD termination? IMHO I'm sure, because this server (IBM x345) is not "built manually" and all guarantee blockers still exist on the cover. Eric, In early August I wrote you directly about this issue -- did you remember it or I could represent the thread here?
The last entry is from last year. I was curious if the latest FC5 (or FC6) kernel worked without trouble or if this is still an outstanding problem?
Sorry, but I've not had time to respond, and I'm working getting RHEL5 drivers builds next week. I'm curious is someone is still needing help on this? The Sysfs tunables are located here: /sys/class/spi_transport/target<H>:<C>:<T>/ period(or factor) is defined below factor:0x08 Ultra320 (160 Mega-transfers / second) (6.25 ns) factor:0x09 Ultra160 ( 80 Mega-transfers / second) (12.5 ns) factor:0x0A Ultra2 ( 40 Mega-transfers / second) (25 ns) factor:0x0B Ultra2 ( 40 Mega-transfers / second) (30.3 ns) factor:0x0C Ultra ( 20 Mega-transfers / second) (50 ns) factor:0x19 FAST ( 10 Mega-transfers / second) factor:0x32 SCSI ( 5 Mega-transfers / second) factor:0xFF 5 Mega-trasfers/second and asynchronous
Hmmm... With kernel-smp-2.6.19-1.2288.2.1.fc5, the issue seems disappeared. Eric, I'm curious what changes between 2.6.18/2.6.19 cause this to be OK now?
I'm happy to see that we can close this issue. Can you diff scsi_transport_spi.c between those two kernels? Its probably a fix that I worked with James Bottomley this past year in the spi transport layer. The transport layer was starting the max speed test as too high of negotation, and was not taking into account the inquiry data that would tell what should be the max speed, and all it was doing was talking the values that bios was returning, which would be U320 speeds.
Eric - just FYI, RHEL 5 may or may not have inherited this fix depending on its actual inclusion into the upstream kernel. We may have to work offline to see if we need to request a backport into a future RHEL 5 minor release.
No, it didn't make it in. The fix occurred in 2.6.19, and RHEL5 pull was on 2.6.18. We had disuccsions with Tom Coughlan this past November, the subject was "RHEL5 RFC - spi transport bug fix". I will send you the emails seperate.