Description of problem: While running I/O on DM multipath devices , we are seeing frequent path failures which leads to unexpected I/O failover. Environment: RH53 RC2 HP Proliant and Integrity Blades Qlogic QMH2462 and Emulex LPe1105 using Inbox driver with DM Test: Hazard C8 Snippet of issue > Jan 14 14:40:34 RH53-IA64 kernel: sd 1:0:6:12: SCSI error: return code > = 0x00020000 Jan 14 14:40:34 RH53-IA64 kernel: end_request: I/O error, > dev sduf, sector 97312554 Jan 14 14:40:34 RH53-IA64 kernel: device-mapper: multipath: Failing path 66:624. > Jan 14 14:40:34 RH53-IA64 kernel: sd 1:0:6:6: SCSI error: return code > = 0x00020000 Jan 14 14:40:34 RH53-IA64 kernel: end_request: I/O error, > dev sdtz, sector 97555949 Jan 14 14:40:34 RH53-IA64 kernel: device-mapper: multipath: Failing path 66:528. > Jan 14 14:40:34 RH53-IA64 kernel: sd 1:0:6:6: SCSI error: return code > = 0x00020000 Jan 14 14:40:34 RH53-IA64 kernel: end_request: I/O error, > dev sdtz, sector 97630204 Jan 14 14:40:34 RH53-IA64 kernel: sd Version-Release number of selected component (if applicable): RH53RC2 How reproducible: Occurs every time and continues as long as IO is running Steps to Reproduce: 1. Run IO with no perturbations(Hazard C8) 2. DM failed paths reported in messages log Actual results: Unexpected path failure is seeing during the I/O. Expected results: Additional info:
Created attachment 329227 [details] Message file with failing path issue
Is this something that you do not see in RHEL 5.2? It looks like we get DID_BUS_BUSY which as you saw in the notes for 244967 is fast failed. You are using Qlogic cards right? If so then we should speak to them about if the problem you are hitting can return DID_TRANSPORT_DISRUPTED instead of DID_BUS_BUSY. Qlogic will probably need you to run this with extending logging on and then send those logs so they can see exactly why DID_BUS_BUSY is returned.
HI, We are seeing the issue on both Qlogic and Emulex cards. The first log I posted was for QLOgic. I will post a log for Emulex. I contacted QLogic and will wait for their response I notified Emulex of your comments and they are: This response does not apply to Emulex. We updated our driver to return DID_TRANSPORT_DISRUPTED in case of dropped frame any way.
Created attachment 329248 [details] Emulex Message file for DM failed paths
Emulex has requested I open a seperate Bugzilla for their DM path failure issue. They have determined it is different than the QLogic issue: Can we use this one just for QLogic? From EMULEX: I reviewed the log file for qlogic run. I am seeing frequent path failover due to DID_BUSY coming from Qlogic which is different than Emulex. I believe you need to open a new bugzilla with Emulex log . We are encountering an issue where new retry DM logic does not retry in case of aborted command.
I opened Bugzilla 480394 for the Emulex DM path failure issue
Thanks. In the emulex one we see rport-2:0-2: blocked FC remote port time out: saving binding, so if the rport is going to timeout changing qla2xxx to DID_TRANSPORT_DISRUPTED is not going to make a difference. It would only make a difference if we recovered before the rport timedout. You can probably ask Qlogic to use the new values but for this it will not help. We probably need the extended logging info to see what is causing the problem in the first place.
Created attachment 329645 [details] QLogic serial console output with extended error logging enabled
I've got the same trouble with QLE2462(fw 4.03.02, driver 8.02.00-k5-rhel5.2-04), CentOS5.2(kernel 2.6.18.92.1.18) and Xyratex 5412 storage controller.