| Summary: | mptbase: ioc0 abort on heavy IO | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | daryl herzmann <akrherz> |
| Component: | kernel | Assignee: | Tomas Henzl <thenzl> |
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.1 | CC: | orion |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-12-11 12:24:30 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. To update this, I installed the 4.28.00.00 Fusion MPT driver and haven't had any issues since (knock on wood!) Hi, I got a comment via email with this: Daryl, I noticed this bz a bit late, but could you please retest with latest RHEL6.4? Thanks, Tomas but don't see it on the website currently. Anyway, are you asking me to uninstall the Fusion MPT stuff and use the stock that comes with the latest RHEL6.4 kernel? (In reply to Daryl Herzmann from comment #4) > Hi, I got a comment via email with this: > > Daryl, > I noticed this bz a bit late, but could you please retest with latest > RHEL6.4? > Thanks, Tomas > > but don't see it on the website currently. Anyway, are you asking me to > uninstall the Fusion MPT stuff and use the stock that comes with the latest > RHEL6.4 kernel? This is a funny glitch in our some kind of "Mid-air collision". Yes I' like you retest with our latest built-in driver. The RHEL driver should be more or less the same as the upstream version. So I expect you'll not see the problem any more, but if it remains we can work with LSI to find the cause. (In reply to Tomas Henzl from comment #5) > This is a funny glitch in our some kind of "Mid-air collision". Yes I' like > you retest with our latest built-in driver. The RHEL driver should be more > or less the same as the upstream version. So I expect you'll not see the > problem any more, but if it remains we can work with LSI to find the cause. Thanks. I'll do this immediately after the Labor Day weekend holiday. I'd rather not break something before going out of town! (In reply to Daryl Herzmann from comment #6) > Thanks. I'll do this immediately after the Labor Day weekend holiday. I'd > rather not break something before going out of town! I'm now running kernel 2.6.32-358.18.1.el6.x86_64 and # modinfo mptbase filename: /lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/message/fusion/mptbase.ko version: 3.04.20 license: GPL description: Fusion MPT base driver author: LSI Corporation srcversion: 875D33B8B7D4669E041798B depends: vermagic: 2.6.32-358.18.1.el6.x86_64 SMP mod_unload modversions parm: mpt_msi_enable_spi: Enable MSI Support for SPI controllers (default=0) (int) parm: mpt_msi_enable_fc: Enable MSI Support for FC controllers (default=0) (int) parm: mpt_msi_enable_sas: Enable MSI Support for SAS controllers (default=0) (int) parm: mpt_channel_mapping: Mapping id's to channels (default=0) (int) parm: mpt_debug_level: debug level - refer to mptdebug.h - (default=0) parm: mpt_fwfault_debug:Enable detection of Firmware fault and halt Firmware on fault - (default=0) (int) Things look okay so far and no kernel messages have appeared to this point. I'll be watching! Daryl, I'm closing the bz now, feel free to reopen when you see any issues with that controller. I just hit this:
Oct 26 01:00:01 csdisk3 kernel: md: data-check of RAID array md3
Oct 26 01:00:01 csdisk3 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Oct 26 01:00:01 csdisk3 kernel: md: using maximum available idle IO bandwidth (but not more than
200000 KB/sec) for data-check.
Oct 26 01:00:01 csdisk3 kernel: md: using 128k window, over a total of 1953511936k.
Oct 26 01:16:55 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}
, SubCode(0x3000) cb_idx mptbase_reply
Oct 26 01:16:56 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}
, SubCode(0x3000) cb_idx mptscsih_io_done
Oct 26 01:16:56 csdisk3 kernel: LSI Debug log info 31123000 for channel 0 id f
Oct 26 01:16:56 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}
, SubCode(0x3000) cb_idx mptscsih_io_done
Oct 26 01:16:56 csdisk3 kernel: LSI Debug log info 31123000 for channel 0 id f
....
04:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
Firmware image's version is MPTFW-01.26.00.00-IE LSI Logic
x86 BIOS image's version is MPTBIOS-6.24.00.00 (2008.07.01)
This was with kernel-2.6.32-431.20.3.el6.x86_64. Now running kernel-2.6.32-431.29.2.el6.x86_64.
Boot messages from -431.20.3:
Fusion MPT base driver 3.04.20
Fusion MPT SAS Host driver 3.04.20
mptsas 0000:04:00.0: PCI INT A -> GSI 30 (level, low) -> IRQ 30
mptbase: ioc0: Initiating bringup
mptsas 0000:04:00.0: setting latency timer to 64
mptsas: ioc0: add expander: num_phys 38, sas_addr (0x500065b36789abff)
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 10, phy 0, sas_addr 0x500065b36789abe0
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 11, phy 1, sas_addr 0x500065b36789abe1
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 12, phy 2, sas_addr 0x500065b36789abe2
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 13, phy 3, sas_addr 0x500065b36789abe3
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 14, phy 4, sas_addr 0x500065b36789abe4
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 15, phy 5, sas_addr 0x500065b36789abe5
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 16, phy 6, sas_addr 0x500065b36789abe6
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 17, phy 7, sas_addr 0x500065b36789abe7
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 18, phy 8, sas_addr 0x500065b36789abe8
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 19, phy 9, sas_addr 0x500065b36789abe9
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 20, phy 10, sas_addr 0x500065b36789abea
mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 21, phy 11, sas_addr 0x500065b36789abeb
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 73, phy 20, sas_addr 0x500065b36789abfd
My card is an Intel SASUC8I branded one. Just updated to firmware version 01.33.190.00. Unfortunately the exact meaning of the error bits in "log info 31123000" has not been made public by Avago(LSI). In cases similar to this I have seen so far it was mostly decoded as crc or some other hw related issue - cables, dying disks etc. Please check the hardware or retest on another server if possible. Closing the bz, the log looks like a hw issue and no new information came in. Thanks, I'll try to change some hardware. For the record, some later messages:
Dec 7 14:30:12 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptbase_reply
Dec 7 14:30:12 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) cb_idx mptbase_reply
Dec 7 14:30:13 csdisk3 kernel: mptbase: ioc0: LogInfo(0x31110b00): Originator={PL}, Code={Reset}, SubCode(0x0b00) cb_idx mptscsih_io_done
Dec 7 14:30:13 csdisk3 kernel: LSI Debug log info 31110b00 for channel 0 id a
appeared to recover from this. Then:
Dec 14 01:00:01 csdisk3 kernel: md: data-check of RAID array md3
Dec 14 01:00:01 csdisk3 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Dec 14 01:00:01 csdisk3 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Dec 14 01:00:01 csdisk3 kernel: md: using 128k window, over a total of 1953511936k.
Dec 14 01:00:32 csdisk3 kernel: mptbase: ioc0: WARNING - IOC is in FAULT state (7815h)!!!
Dec 14 01:00:32 csdisk3 kernel: mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
Dec 14 01:00:32 csdisk3 kernel: mptbase: ioc0: Initiating recovery
Dec 14 01:00:32 csdisk3 kernel: mptbase: ioc0: WARNING - IOC is in FAULT state!!!
Dec 14 01:00:32 csdisk3 kernel: mptbase: ioc0: WARNING - FAULT code = 7815h
Dec 14 01:00:35 csdisk3 kernel: mptbase: ioc0: Recovered from IOC FAULT
Dec 14 01:10:36 csdisk3 kernel: mptbase: ioc0: ERROR - Doorbell INT timeout (count=299999), IntStatus=0!
Dec 14 01:10:36 csdisk3 kernel: mptbase: ioc0: ERROR - Handshake reply failure!
Dec 14 01:10:36 csdisk3 kernel: mptbase: ioc0: ERROR - Sending PortEnable failed(-1)!
Dec 14 01:10:36 csdisk3 kernel: mptbase: WARNING - (-4) Cannot recover ioc0, doorbell=0x2c000000
Dec 14 01:10:36 csdisk3 kernel: mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: failed
Dec 14 01:10:36 csdisk3 kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
|
I have a Dell PowerEdge T410 that has a: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS Firmware Version 00.25.47.00.06.22.03.00 Driver Version 3.04.18 with 1 250 GB WD and 5 Samsung 2 TB HD204UI drives connected to it. The 5 samsung drives are in a software RAID5 The weekly cron'd raid resync causes IO to lock up on the array with thousands of messages spewed to /var/log/messages like so: Jul 10 01:00:01 xxx kernel: md: data-check of RAID array md127 Jul 10 01:00:01 xxx kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jul 10 01:00:01 xxx kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Jul 10 01:00:01 xxx kernel: md: using 128k window, over a total of 1953510400 blocks. Jul 10 01:01:07 xxx kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL }, Code={Abort}, SubCode(0x3000) cb_idx mptbase_reply Jul 10 01:01:08 xxx kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL }, Code={Abort}, SubCode(0x3000) cb_idx mptscsih_io_done Jul 10 01:01:08 xxx kernel: LSI Debug log info 31123000 for channel 0 id 1 Jul 10 01:01:08 xxx kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL }, Code={Abort}, SubCode(0x3000) cb_idx mptscsih_io_done Jul 10 01:01:08 xxx kernel: LSI Debug log info 31123000 for channel 0 id 1 Jul 10 01:01:08 xxx kernel: mptbase: ioc0: LogInfo(0x31123000): Originator={PL }, Code={Abort}, SubCode(0x3000) cb_idx mptscsih_io_done reviewing logs this past week, I do see messages like this since the last issue. Jul 8 19:11:28 xxx kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880 1eac7a0c0) Jul 8 19:11:28 xxx kernel: sd 0:0:2:0: [sdc] CDB: Write(10): 2a 00 bc 26 0d 7 7 00 00 b8 00 Jul 8 19:11:28 xxx kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL }, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done Jul 8 19:11:28 xxx kernel: LSI Debug log info 31140000 for channel 0 id 1 Jul 8 19:11:28 xxx kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc= ffff8801eac7a0c0) (sn=234180793) Jul 8 19:11:28 xxx kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880 32510d880) [snip] Jul 8 19:11:28 xxx kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc= ffff880037243a80) (sn=234180878) The system is running a fully updated RHEL6.1 (non fastrack). Thank you.