Our system (Sun Fire X4200) is connected through two Sun FC-AL 4gbs cards (OEM of Qlogic 2460) and 2 Brocade 200E switches to a Storage array Sun Storagetek 6340. We are experiencing, after a while and during high IO activity, some disconnect from the SAN like reported in /var/log/messages : Jan 18 09:58:22 hector kernel: qla2400 0000:05:01.0: ISP Request Transfer Error. Jan 18 09:58:22 hector kernel: qla2400 0000:05:02.0: ISP Request Transfer Error. Jan 18 09:58:22 hector kernel: qla2400 0000:05:01.0: Performing ISP error recovery - ha= 00000101fbd983c8. Jan 18 09:58:22 hector kernel: qla2400 0000:05:02.0: Performing ISP error recovery - ha= 00000101fbed03c8. Jan 18 09:58:52 hector kernel: qla2400 0000:05:01.0: [ERROR] Failed to load segment 0 of firmware Jan 18 09:58:52 hector kernel: Mailbox registers: Jan 18 09:58:52 hector kernel: scsi(1): mbox 0 0x0000 Jan 18 09:58:52 hector kernel: scsi(1): mbox 1 0x0000 Jan 18 09:58:52 hector kernel: scsi(1): mbox 2 0x0001 Jan 18 09:58:52 hector kernel: scsi(1): mbox 3 0x4000 Jan 18 09:58:52 hector kernel: scsi(1): mbox 4 0x0040 Jan 18 09:58:52 hector kernel: scsi(1): mbox 5 0x0000 Jan 18 09:58:52 hector kernel: qla2400 0000:05:02.0: [ERROR] Failed to load segment 0 of firmware Jan 18 09:58:52 hector kernel: Mailbox registers: Jan 18 09:58:52 hector kernel: scsi(2): mbox 0 0x0000 Jan 18 09:58:52 hector kernel: scsi(2): mbox 1 0x0000 Jan 18 09:58:52 hector kernel: scsi(2): mbox 2 0x0001 Jan 18 09:58:52 hector kernel: scsi(2): mbox 3 0x4000 Jan 18 09:58:52 hector kernel: scsi(2): mbox 4 0x0040 Jan 18 09:58:52 hector kernel: scsi(2): mbox 5 0x0000 Jan 18 09:59:22 hector kernel: qla2400 0000:05:01.0: [ERROR] Failed to load segment 0 of firmware Jan 18 09:59:22 hector kernel: Mailbox registers: Jan 18 09:59:22 hector kernel: scsi(1): mbox 0 0x0000 Jan 18 09:59:22 hector kernel: scsi(1): mbox 1 0x0000 Jan 18 09:59:22 hector kernel: scsi(1): mbox 2 0x0001 Jan 18 09:59:22 hector kernel: scsi(1): mbox 3 0x4000 Jan 18 09:59:22 hector kernel: scsi(1): mbox 4 0x0040 Jan 18 09:59:22 hector kernel: scsi(1): mbox 5 0x0000 The problem is reported on BOTH the card, that is preventing any failover (mpp driver). Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 7 [RAIDarray.mpp]SAN1:1:0 Path Failed Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039014 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039019 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039022 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039027 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039030 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039035 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039039 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: 495 [RAIDarray.mpp]SAN1:1:0:0 Cmnd failed-retry on a new path. vcmnd SN 25039043 pdev H1:C0:T1:L0 0x00/0x00/0x00 0x00010000 mpp_statu Jan 18 10:00:56 hector kernel: 94 [RAIDarray.mpp]SAN1:1:0:0 Selection Retry count exhausted Jan 18 10:00:56 hector kernel: st: I/O error, dev sdb, sector 485504928 Jan 18 10:00:56 hector kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jan 18 10:00:56 hector kernel: end_request: I/O error, dev sdb, sector 485502648 Jan 18 10:00:56 hector kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jan 18 10:00:56 hector kernel: end_request: I/O error, dev sdb, sector 485505944 Jan 18 10:00:56 hector kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jan 18 10:00:56 hector kernel: end_request: I/O error, dev sdb, sector 485503664 Jan 18 10:00:56 hector kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jan 18 10:00:56 hector kernel: end_request: I/O error, dev sdb, sector 485498568 Jan 18 10:00:56 hector kernel: SCSI error : <3 0 0 0> return code = 0x10000 Jan 18 10:00:56 hector kernel: end_request: I/O error, dev sdb, sector 485496440 Of course, after a while, the device is in error (so ext3 journal aborting, etc).
Created attachment 145904 [details] dmesg output
Created attachment 145905 [details] /var/log/messages file
I have to say that using the recommended driver downloaded from Qlogic (as indicated in documentation from Sun Microsystems), I keep getting slightly differents errors. I've reverted to "standard" driver from the kernel in order to simplify the update. The file downloaded from QLogic is qla2xxx-v8.01.06-dist.tgz.
This issue has been reported to QLogic by Sun and its customers. The issue stems from this platforms (x4200) inability to support modifications to the PCI Max-Memory-Read-Byte-Count. I can also see that the card is connected into one of the host's 66mhz slot: QLogic Fibre Channel HBA Driver: 8.01.04-d7 QLogic QLA2460 - Sun PCI-X 2.0 to 4Gb FC, Single Channel ISP2422: PCI-X Mode 1 (66 MHz) @ 0000:05:01.0 hdma+, host#=1, fw=4.00.18 [IP] A potential workaround for this issue is to place the HBA in a 133MHZ slot. Beyond that, I'd suggest the customer work directly with Sun.
The Sun X4200 does have 3 PCI-X 66MHz slots, 1 133MHz and 1 100MHz. Since I have 2 cards to connect (in order to introduce redundancy in the SAN connection), I can put one in a 133 MHz slot and the other in the 100MHz slot. Do you think that workaround could work ? Meanwhile, I'll report the problem to Sun. And thanks to Andrew for the fast reply.
We've only seen the issue when FC HBA cards are attached to the 66Mhz slots.