Bug 516683
| Summary: | spurious DM multipath failures due to incorrect SCSI err handling on full scsi tag queue. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Mark Goodwin <mgoodwin> | ||||||
| Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 5.4 | CC: | akarlsso, bmarzins, bmr, cward, mchristi, rdassen, tao | ||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||
| Target Release: | 5.5 | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-10-20 15:45:31 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 499522, 525215, 533192 | ||||||||
| Attachments: |
|
||||||||
|
Description
Mark Goodwin
2009-08-11 05:47:40 UTC
Created attachment 357281 [details]
Customer analysis
Created attachment 360041 [details]
RHEL5.3 based patch to make sd_max_retries tunable via sysfs.
Creates /sys/module/sd_mod/parameters/sd_max_retries with default value of 5.
Mike (and qlogic engineers) :
The HBAs are stock ISP2432 cards:
04:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 02)
04:00.1 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 02)
04:00.0 0c04: 1077:2432 (rev 02)
Subsystem: 1077:0138
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 128 bytes
Interrupt: pin A routed to IRQ 169
Region 0: I/O ports at 3000 [size=256]
Region 1: Memory at dc300000 (64-bit, non-prefetchable) [size=16K]
[virtual] Expansion ROM at d1000000 [disabled] [size=256K]
Capabilities: [44] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [4c] Express Endpoint IRQ 0
Device: Supported: MaxPayload 1024 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <4us, L1 <1us
Device: AtnBtn+ AtnInd+ PwrInd+
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 2048 bytes
Link: Supported Speed 2.5Gb/s, Width x4, ASPM L0s, Port 0
Link: Latency L0s <4us, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch+
Link: Speed 2.5Gb/s, Width x4
Capabilities: [64] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [74] Vital Product Data
Capabilities: [7c] MSI-X: Enable- Mask- TabSize=16
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003000
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting
The driver reports the following during initialization :
QLogic Fibre Channel HBA Driver
ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 16 (level, low) -> IRQ 169
qla2xxx 0000:04:00.0: Found an ISP2432, irq 169, iobase 0xffffc2000001a000
qla2xxx 0000:04:00.0: Configuring PCI space...
PCI: Setting latency timer of device 0000:04:00.0 to 64
qla2xxx 0000:04:00.0: Configure NVRAM parameters...
qla2xxx 0000:04:00.0: Verifying loaded RISC code...
qla2xxx 0000:04:00.0: Allocated (64 KB) for EFT...
qla2xxx 0000:04:00.0: Allocated (1413 KB) for firmware dump...
scsi1 : qla2xxx
qla2xxx 0000:04:00.0:
QLogic Fibre Channel HBA Driver: 8.02.00.06.05.03-k
QLogic QLE2462 - PCI-Express to 4Gb FC, Dual Channel
ISP2432: PCIe (2.5Gb/s x4) @ 0000:04:00.0 hdma+, host#=1, fw=4.04.05 [IP] [Multi-ID] [84XX]
qla2xxx 0000:04:00.0: LIP reset occured (f8f7).
qla2xxx 0000:04:00.0: LIP occured (f8f7).
qla2xxx 0000:04:00.0: LIP reset occured (f7f7).
qla2xxx 0000:04:00.0: LOOP UP detected (4 Gbps).
GSI 21 sharing vector 0xD1 and IRQ 21
ACPI: PCI Interrupt 0000:04:00.1[B] -> GSI 17 (level, low) -> IRQ 209
qla2xxx 0000:04:00.1: Found an ISP2432, irq 209, iobase 0xffffc2000001c000
qla2xxx 0000:04:00.1: Configuring PCI space...
PCI: Setting latency timer of device 0000:04:00.1 to 64
qla2xxx 0000:04:00.1: Configure NVRAM parameters...
qla2xxx 0000:04:00.1: Verifying loaded RISC code...
Vendor: transtec Model: T6100F08R1-E Rev: 347B
Type: Direct-Access ANSI SCSI revision: 03<6>qla2xxx 0000:04:00.1: Allocated (64 KB) for EFT...
qla2xxx 0000:04:00.1: Allocated (1413 KB) for firmware dump...
We have some details of the FC switches, SAN topology and raid controllers
too if needed.
Thanks
-- Mark
Created attachment 364406 [details]
fix queue full handling
Here is a patch from Qlogic, which they think should fix this. They were mishandling queue fulls as underruns.
> Qlogic is already sending the patch I attached with its RHEL 5.5 update.
Do you know if Qlogic were able to reproduce the problem (and thus demonstrate
the fix)?
Thanks
-- Mark Goodwin
The patch in Comment #28 that QLogic believes will fix this problem is included in Bug 519447, the planned driver 5.5 driver update. I will close this BZ as a duplicate of 519447. Any test results that the customer can provide will be much appreciated. *** This bug has been marked as a duplicate of bug 519447 *** |