624548 – [LSI CR182858] [Qlogic/LSI 5.6 bug] qle8142: Single device I/O timeout after long run of straight I/O

Bug 624548 - [LSI CR182858] [Qlogic/LSI 5.6 bug] qle8142: Single device I/O timeout after long run of straight I/O

Summary: [LSI CR182858] [Qlogic/LSI 5.6 bug] qle8142: Single device I/O timeout after ...

Keywords:
Status:	CLOSED DUPLICATE of bug 567402
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	5.6
Assignee:	Tom Coughlan
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	557597
TreeView+	depends on / blocked

Reported:	2010-08-16 20:36 UTC by Sean Stewart
Modified:	2010-11-24 14:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-16 15:46:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Host log messages (200.52 KB, application/x-zip-compressed) 2010-09-27 19:54 UTC, Abdel Jalal	no flags	Details
Recreate_Oct6 (674.75 KB, application/x-zip-compressed) 2010-10-06 22:33 UTC, Abdel Jalal	no flags	Details
View All

Description Sean Stewart 2010-08-16 20:36:11 UTC

User-Agent:       Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)

As far as I can tell, this I/O timeout will occur whether or not any exceptions are running on the 10 Gbps array. The I/O usually runs for six to 12 hours and then a single device experiences an I/O timeout. Host connections will peridocially get reset, however, other hosts in the configuration (running different operating systems) will recover from this without the I/O timing out, and the same tests have been run successfully with other 10 Gbps NICs.

Configuration:
Host:
Host: kswc-elayne.lsi.com 
OS: RHEL5 U5 
Failover: MPP 99.03.0C00.0430
HBA Model: QLogic 8142 10Gbps CNA (iSCSI)
HBA Driver Ver:  qla2xxx 8.03.01.04.05.05-k  /  qlge 1.00.00.23
HBA FW Ver: 5.02.01

Array: LSI Engenio 7900 with 10Gbps host connection

Reproducible: Always

Steps to Reproduce:
1. Create and map 32 volumes from each array to the host
2. Start I/O to the volumes
3. Let the I/O run for 16 hours
Actual Results:  
A single device on the RHEL 5.5 host will experience an I/O timeout. This can occur anywhere between 6 and 14 hours into the test run. After looking through the logs, failover and failback occured exactly as expected.

Expected Results:  
The test should run for 16 hours without error or I/O timeout

Comment 1 Andrius Benokraitis 2010-08-24 00:19:43 UTC

Chad/Ying - can you take a look into this?

Comment 2 Chad Dupuis (Cavium) 2010-08-24 12:27:14 UTC

(In reply to comment #1)
> Chad/Ying - can you take a look into this?

Is this issue seen through the FCoE driver or through the qlge stack with an iSCSI initiator on top?

Comment 3 Abdel Sadek 2010-09-01 18:54:58 UTC

No FCoE is involved.
we're only using the qlge driver with the iSCSI software initiator.

Comment 4 Chad Dupuis (Cavium) 2010-09-01 20:08:27 UTC

Thanks Abdel.  Would it be possible to get the /var/log/messages file from when the failure occurred?

Comment 5 Abdel Jalal 2010-09-27 19:54:27 UTC

Created attachment 450020 [details]
Host log messages

Comment 6 Abdel Jalal 2010-09-27 20:38:51 UTC

Other hosts (RHEL5.5) with different HBAs (Boadcom57710, 10GB Intel XFSR, Brocade 1020) in the same config experienced the same issue.

Comment 7 Chad Dupuis (Cavium) 2010-09-29 16:07:02 UTC

We looked through the log files and there were no qlge detected errors (i.e. no error messages for any frames, no unexpected link events, etc.).  My guess would be that the issue is either with something further up the networking stack or something on the target side.

Comment 8 Abdel Jalal 2010-10-01 16:48:41 UTC

I noticed that too that the host log doesn't show any connections drops or unexpected link events and the target doesn't show any connections drops either. Is there anything in the scsi middle layer to turn on to capture more information?

Comment 10 Abdel Jalal 2010-10-06 22:33:00 UTC

Created attachment 452008 [details]
Recreate_Oct6

Comment 11 Abdel Jalal 2010-10-06 22:37:18 UTC

A recreate of the issue showed the message below followed by a connection error 

Oct  6 10:20:00 kswc-emmafrost kernel: qlge 0000:09:00.1: ql_process_mac_rx_page: Receive error, flags2 = 0x5d
Oct  6 10:20:00 kswc-emmafrost last message repeated 3 times
Oct  6 10:20:10 kswc-emmafrost kernel:  connection8:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360221376, last ping 4360226376, now 4360231376
Oct  6 10:20:10 kswc-emmafrost kernel:  connection8:0: detected conn error (1011)
Oct  6 10:20:10 kswc-emmafrost iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3)

The host log (Recreate_Oct6) has been provided as attachement

Comment 12 Abdel Sadek 2010-10-07 14:55:31 UTC

Chad,
I see on another Bug 567402 that QLE8142 support has been added through qlge driver update 1.00.00.25. does that mean that it's not supported with the 1.00.00.23 driver that comes with the base RHEL 5.5 kernel 2.6.18-194.el5?

Comment 13 Abdel Jalal 2010-10-11 05:43:06 UTC

I was unable to recreate the issue after upgrading kernel to 2.6.18-225.el5 which has new version of qlogic driver (1.00.00.25) - When will this test kernel be released?

Comment 14 Andrius Benokraitis 2010-10-11 15:22:06 UTC

(In reply to comment #13)
> I was unable to recreate the issue after upgrading kernel to 2.6.18-225.el5
> which has new version of qlogic driver (1.00.00.25) - When will this test
> kernel be released?

This will be publicly released in RHEL 5.6.

Comment 15 Andrius Benokraitis 2010-10-12 22:02:55 UTC

Chad, can I dupe this bugzilla to bug 567402?

Comment 16 Chad Dupuis (Cavium) 2010-10-13 14:53:29 UTC

(In reply to comment #15)
> Chad, can I dupe this bugzilla to bug 567402?

Yes, if 1.00.00.25 resolves this issue then that should be fine.

Comment 17 Tom Coughlan 2010-11-16 15:46:15 UTC


*** This bug has been marked as a duplicate of bug 567402 ***

Note You need to log in before you can comment on or make changes to this bug.