User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)
As far as I can tell, this I/O timeout will occur whether or not any exceptions are running on the 10 Gbps array. The I/O usually runs for six to 12 hours and then a single device experiences an I/O timeout. Host connections will peridocially get reset, however, other hosts in the configuration (running different operating systems) will recover from this without the I/O timing out, and the same tests have been run successfully with other 10 Gbps NICs.
OS: RHEL5 U5
Failover: MPP 99.03.0C00.0430
HBA Model: QLogic 8142 10Gbps CNA (iSCSI)
HBA Driver Ver: qla2xxx 8.03.01.04.05.05-k / qlge 1.00.00.23
HBA FW Ver: 5.02.01
Array: LSI Engenio 7900 with 10Gbps host connection
Steps to Reproduce:
1. Create and map 32 volumes from each array to the host
2. Start I/O to the volumes
3. Let the I/O run for 16 hours
A single device on the RHEL 5.5 host will experience an I/O timeout. This can occur anywhere between 6 and 14 hours into the test run. After looking through the logs, failover and failback occured exactly as expected.
The test should run for 16 hours without error or I/O timeout
Chad/Ying - can you take a look into this?
(In reply to comment #1)
> Chad/Ying - can you take a look into this?
Is this issue seen through the FCoE driver or through the qlge stack with an iSCSI initiator on top?
No FCoE is involved.
we're only using the qlge driver with the iSCSI software initiator.
Thanks Abdel. Would it be possible to get the /var/log/messages file from when the failure occurred?
Created attachment 450020 [details]
Host log messages
Other hosts (RHEL5.5) with different HBAs (Boadcom57710, 10GB Intel XFSR, Brocade 1020) in the same config experienced the same issue.
We looked through the log files and there were no qlge detected errors (i.e. no error messages for any frames, no unexpected link events, etc.). My guess would be that the issue is either with something further up the networking stack or something on the target side.
I noticed that too that the host log doesn't show any connections drops or unexpected link events and the target doesn't show any connections drops either. Is there anything in the scsi middle layer to turn on to capture more information?
Created attachment 452008 [details]
A recreate of the issue showed the message below followed by a connection error
Oct 6 10:20:00 kswc-emmafrost kernel: qlge 0000:09:00.1: ql_process_mac_rx_page: Receive error, flags2 = 0x5d
Oct 6 10:20:00 kswc-emmafrost last message repeated 3 times
Oct 6 10:20:10 kswc-emmafrost kernel: connection8:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360221376, last ping 4360226376, now 4360231376
Oct 6 10:20:10 kswc-emmafrost kernel: connection8:0: detected conn error (1011)
Oct 6 10:20:10 kswc-emmafrost iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3)
The host log (Recreate_Oct6) has been provided as attachement
I see on another Bug 567402 that QLE8142 support has been added through qlge driver update 1.00.00.25. does that mean that it's not supported with the 1.00.00.23 driver that comes with the base RHEL 5.5 kernel 2.6.18-194.el5?
I was unable to recreate the issue after upgrading kernel to 2.6.18-225.el5 which has new version of qlogic driver (1.00.00.25) - When will this test kernel be released?
(In reply to comment #13)
> I was unable to recreate the issue after upgrading kernel to 2.6.18-225.el5
> which has new version of qlogic driver (1.00.00.25) - When will this test
> kernel be released?
This will be publicly released in RHEL 5.6.
Chad, can I dupe this bugzilla to bug 567402?
(In reply to comment #15)
> Chad, can I dupe this bugzilla to bug 567402?
Yes, if 1.00.00.25 resolves this issue then that should be fine.
*** This bug has been marked as a duplicate of bug 567402 ***