Bug 736871 - Unhandled error code - Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Summary: Unhandled error code - Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRI...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: iscsi-initiator-utils
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Andy Grover
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-08 22:06 UTC by Dinesh Surpur
Modified: 2012-06-26 16:26 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-01-30 20:15:27 UTC


Attachments (Terms of Use)

Description Dinesh Surpur 2011-09-08 22:06:39 UTC
Description of problem:
IO stop & resume unexpectedly during failover test

Hostlog show:
...
Sep  7 19:27:34 hpc7000-07-b4 kernel: connection2:0: detected conn error (1020)
Sep  7 19:27:35 hpc7000-07-b4 iscsid: Kernel reported iSCSI connection 2:0 error (1020) state (3)
Sep  7 19:27:39 hpc7000-07-b4 kernel: session2: session recovery timed out after 5 secs
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] Unhandled error code
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] Unhandled error code
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] CDB: Read(10): 28 00 00 00 33 72 00 00 0e 00
Sep  7 19:27:39 hpc7000-07-b4 kernel: end_request: I/O error, dev sdb, sector 13170
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Sep  7 19:27:39 hpc7000-07-b4 kernel: sd 8:0:0:3: [sdb] CDB: Read(10):


Version-Release number of selected component (if applicable):
ProLiant BL460c G7 with 10GbE NC553i FlexFabric 2 Ports for 10G iSCSI test
Cisco Nexus FCOE switch - 4.2(1)N2(1)

Red Hat Enterprise Linux Server release 6.1 (Santiago)
2.6.32-131.0.15.el6.x86_64
lpfc drv: 8.3.5.30.1p & fwver: 3.702.517.701, sli-4

iscsi-initiator-utils-6.2.0.872-21.el6.x86_64

device-mapper-multipath-0.4.9-41.el6.x86_64
device-mapper-libs-1.02.62-3.el6.x86_64
device-mapper-1.02.62-3.el6.x86_64
udev-147-2.35.el6.x86_64


How reproducible:
Start IO & make sure IO steady
Reset a node to make failover
IO become un-steady when failover - IO stop & resume unexpectedly


Steps to Reproduce:
1.Start IO & make sure IO steady
2.Reset a node to make failover
3.IO become un-steady when failover - IO stop & resume unexpectedly

  
Actual results:


Expected results:


Additional info:

Comment 2 Mike Christie 2011-09-13 21:49:36 UTC
I am not sure I understand the problem. What do you think should be happening?

It looks like the target drops the connection. The initiator can then not relogin for 5 secs (the replacement/recovery timeout you have set), so the initiator fails IO. This is expected if you by node you iscsi node as in iscsi target. During the time it is rebooted the initiator cannot access it, correct, or is there supposed to be some target failover magic? If the initiator cannot access it for longer then you have set the replacement/recovery timeout for then you will get IO errors until the target comes back online and we can relogin.

Or, is the log output supposed to be some other initiator/session in the system that should not be affected by the node reboot?

Comment 3 Dinesh Surpur 2011-09-23 00:12:42 UTC
Mike,

What we are concerned is the error code "DID_TRANSPORT_FAILFAST". As we are running the device mapper multipath it is possible that Device Mapper can interpret this error code or is the error code is sent directly to the application layer  ? We are running some data integrity test tool which gives out data miscompare errors when such errors happen. The tool works fine for Solaris, VMware, Windows but only has a problem with Redhat and SuSE Linux hosts connected to the same storage. We checked our data storage array when the data compare error is reported for the specific block and the data is correct. So we are wondering is it the SCSI or device mapper layer having issues in addressing this error code ? 

"Sep 22 14:41:34 per610-01 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 10, last rx 4297180765, last ping 4297
190765, now 4297195765                                                                      Sep 22 14:41:34 per610-01 kernel: connection1:0: detected conn error (1011)
Sep 22 14:41:34 per610-01 kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 10, last rx 4297180765, last ping 4297
190765, now 4297195765
Sep 22 14:41:34 per610-01 kernel: connection2:0: detected conn error (1011)
Sep 22 14:41:35 per610-01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3)
Sep 22 14:41:35 per610-01 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)
Sep 22 14:41:39 per610-01 kernel: session1: session recovery timed out after 5 secs              
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Unhandled error code
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK  ==========
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] CDB: Read(10): 28 00 00 00 02 e1 00 00 0f 00
Sep 22 14:41:39 per610-01 kernel: end_request: I/O error, dev sdb, sector 737
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Unhandled error code
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] CDB: Read(10): 28 00 00 00 36 21 00 00 0a 00
Sep 22 14:41:39 per610-01 kernel: end_request: I/O error, dev sdb, sector 13857"


Also here is a blog which talks on similar issue
http://permalink.gmane.org/gmane.linux.iscsi.open-iscsi/5673

Is there a newer iSCSI or multipath package to address this issue ?

Also, to get more debugging info from SCSI Layer is this the appropriate flag to be set for this issue  ?

echo 9411 > /proc/sys/dev/scsi/logging_level

Comment 4 Mike Christie 2011-09-24 01:13:41 UTC
(In reply to comment #3)
> Mike,
> 
> What we are concerned is the error code "DID_TRANSPORT_FAILFAST". As we are
> running the device mapper multipath it is possible that Device Mapper can
> interpret this error code or is the error code is sent directly to the


It is sent to the multipath later.


Are you seeing dm multipath fail the IO or retry it? Is there more to the log? Do you see IO errors sent from the dm/multipath layer?



> application layer  ? We are running some data integrity test tool which gives
> out data miscompare errors when such errors happen. The tool works fine for
> Solaris, VMware, Windows but only has a problem with Redhat and SuSE Linux
> hosts connected to the same storage. We checked our data storage array when the
> data compare error is reported for the specific block and the data is correct.
> So we are wondering is it the SCSI or device mapper layer having issues in
> addressing this error code ? 

iSCSI just passes the error upwards for whatever is above it to handle it.

At this time do you have more paths available? Should multipath be retrying or should it be queueing due to not having any paths? If the latter what is the no path retry or queue if no path setting?



> 
> "Sep 22 14:41:34 per610-01 kernel: connection1:0: ping timeout of 5 secs
> expired, recv timeout 10, last rx 4297180765, last ping 4297
> 190765, now 4297195765                                                         
>             Sep 22 14:41:34 per610-01 kernel: connection1:0: detected conn
> error (1011)
> Sep 22 14:41:34 per610-01 kernel: connection2:0: ping timeout of 5 secs
> expired, recv timeout 10, last rx 4297180765, last ping 4297
> 190765, now 4297195765
> Sep 22 14:41:34 per610-01 kernel: connection2:0: detected conn error (1011)
> Sep 22 14:41:35 per610-01 iscsid: Kernel reported iSCSI connection 1:0 error
> (1011) state (3)
> Sep 22 14:41:35 per610-01 iscsid: Kernel reported iSCSI connection 2:0 error
> (1011) state (3)
> Sep 22 14:41:39 per610-01 kernel: session1: session recovery timed out after 5
> secs              
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Unhandled error code
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK  ==========
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] CDB: Read(10): 28 00 00 00
> 02 e1 00 00 0f 00
> Sep 22 14:41:39 per610-01 kernel: end_request: I/O error, dev sdb, sector 737
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Unhandled error code
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> Sep 22 14:41:39 per610-01 kernel: sd 9:0:0:1: [sdb] CDB: Read(10): 28 00 00 00
> 36 21 00 00 0a 00
> Sep 22 14:41:39 per610-01 kernel: end_request: I/O error, dev sdb, sector
> 13857"
> 
> 
> Also here is a blog which talks on similar issue
> http://permalink.gmane.org/gmane.linux.iscsi.open-iscsi/5673
> 
> Is there a newer iSCSI or multipath package to address this issue ?
> 

What issue in there exactly? It looks like the person in that mail you pointed to was using the device directly and not a multiapth device. You are using dm multipath right?



> Also, to get more debugging info from SCSI Layer is this the appropriate flag
> to be set for this issue  ?
> 
> echo 9411 > /proc/sys/dev/scsi/logging_level

Increasing the scsi logging is not really going to help much. We know the scsi layer is sending the error upwards. You could turn on iscsi or scsi debugging to see if multipath has sent the IO to another path but that is going to get really noisy.

Comment 5 Dinesh Surpur 2011-09-28 23:40:38 UTC
Mike,

Here is an overview and i will further update the defect for the queries you have asked.

We have two internal HP data verification tools called as snaptest and also hazard. Each tool writes a specific pattern and reads back for data verification. This is being tested in HP Storage lab.

The setup has multiple hosts and each host has either a QLogic 8242 CNA, Emulex CN1000E CNA and a HP 10G Nic running RH6.1 or RH 5.7 and connected through the Cisco Nexus FCoE switches running software iSCSI setup with device mapper connected to a P10000 HP 3PAR Storage Array with target 10G QLogic 8242 card. Is is a non-clustered environment. The version of the iSCSI software and device mapper is what comes with RH 5.7 and 6.1.

Snaptest data verification tool:

Running long duration I/O
================================

When we run snaptest I/O tool and just doing I/O we hit data corruption error sometimes within 2hr and sometimes its longer like 8hr. Before the corruption is reported we do not see any sort of I/O error from the either scsi or dm layer but sometimes I/O service times are higher. Also, we panicked our storage nodes when data corruption is reported from snaptest and analyzed the traces and here is what one storage developer mentioned

"Analyzed the latest failure and discovered that the cache page on storage itself contained the incorrect data that was returned on a subsequent read by the snaptest. Either the initiator gave us bad WRITE data or the data got garbled  somewhere along our stack after reaching the target port."

Running Panic on storage node
======================================
We also did a test where we panic one of the storage node and also resulted in data corruption reported by snaptest.

Running without device mapper
===================================

We ran on block device without device mapper (disabled it)ex: /dev/sdb the snaptest ran fine for 24hrs with no corruption reported.

On other OS's
===============

We ran snaptest on Windows, Solaris, VMware and Citrix and snaptest runs fine with the same P10000 storage array with 10G iSCSI setup with the same FCoE HBAs Qlogic 8242, Emulex CN100E and HP 10G Nic on the host. No data corruption reported. 

Running on Fibre Channel:
================================

Removed the iscsi and connected Fibre Channel HBA to storage and ran snaptest and works without any data corruption for 8-10hrs.

Hazard tool
==================

When we run hazard I/O tool we are seeing  firmware cores on the target storage port and results in loss of iSCSI connection and dm path failover occurs and when the paths come back data corruption is been reported when the IO restarts.
 
Here is the analysis from Hazard Tool Team

---------------------------------------------------
Here is a snippet of simplified view a Hazard process that detected a corruption.  In each count or loop, the process will do a read and then write.  You can see leading up to corruption, count 55, 56 both read and write were notified with I/O error.  Even the read in count 57 failed.  However, there was no more error with next 23 seconds.  That means the write of count 57 was successful and so were count 58, 59, 60 … 69.  Since write had been successful, there would be expectation to what is reading back.  And read on count 70 encountered the miscompare.  When it tried to re-read, it got I/O error.  And rest of the counts onward  had I/O error too.  Again this is a simplified view.  In between each of these lines, there are many more lines logged by sd driver in kernel of I/O errors.

   b200m1-06-81.Sep-19 02:11:39.CRIT!:Failed read() /dev/mapper/mpathnp2 (count=55): Input/output error
   b200m1-06-81.Sep-19 02:11:40.CRIT!:Failed write() /dev/mapper/mpathnp2 (count=55): Input/output error
   b200m1-06-81.Sep-19 02:11:41.CRIT!:Failed read() /dev/mapper/mpathnp2 (count=56): Input/output error
   b200m1-06-81.Sep-19 02:11:42.CRIT!:Failed write() /dev/mapper/mpathnp2 (count=56): Input/output error
   b200m1-06-81.Sep-19 02:11:43.CRIT!:Failed read() /dev/mapper/mpathnp2 (count=57): Input/output error
   b200m1-06-81.Sep-19 02:12:06.CRIT!:Corruption on /dev/mapper/mpathnp2 (count=70):
   b200m1-06-81.Sep-19 02:12:07.info :  blocks 75 (LBA 0x00000000009f4807) thru 126 (LBA 0x00000000009f483a) are
corrupt
   b200m1-06-81.Sep-19 02:12:07.info :  device seek LBA 75 (0x000000000000004b)
   b200m1-06-81.Sep-19 02:13:11.WARN :Failed read():re /dev/mapper/mpathnp2: Input/output error
   b200m1-06-81.Sep-19 02:13:12.WARN :(re)read: -1 != 70656
   b200m1-06-81.Sep-19 02:13:12.info :Unexpected data extent: relative offset 38824; length 26064
   b200m1-06-81.Sep-19 02:13:12.info :Type: unknown (reread data not available)
   b200m1-06-81.Sep-19 02:13:13.info :Corruption Summary:
   b200m1-06-81.Sep-19 02:13:13.info : 127 blocks had Hazard data (12 sequences)
   b200m1-06-81.Sep-19 02:13:13.info : 11 blocks had some miscompared/unrecognized data
   b200m1-06-81.Sep-19 02:13:13.info : 31 blocks had Hazard data but with wrong pattern offset
   b200m1-06-81.Sep-19 02:13:13.info : 0 blocks had no expectation
   b200m1-06-81.Sep-19 02:13:13.CRIT!:Failed write() /dev/mapper/mpathnp2 (count=70): Input/output error
   b200m1-06-81.Sep-19 02:13:14.CRIT!:Failed read() /dev/mapper/mpathnp2 (count=71): Input/output error
   b200m1-06-81.Sep-19 02:13:15.CRIT!:Failed write() /dev/mapper/mpathnp2 (count=71): Input/output error


In the log, one can tell when iscsi connection dropped but I can’t tell when it came back.  Anyhow, here are chunks of some events:

   Sep-17 ~18:59 test started and everything was good till… (more than a day)
   Sep-19 ~01:47 Failing path; one or more paths failing
   Sep-19 ~01:55 I/O error; no more path available
   Sep-19 ~01:58 Failing path; one or more paths failing;
   Sep-19 ~01:59 I/O error; no more path available
   Sep-19 ~02:00 Failing path; one or more paths failing
   Sep-19 ~02:01 I/O error; no more path available
   Sep-19 ~02:12 Failing path; one or more paths failing
   Sep-19 ~02:13 I/O error; no more path available
   Sep-19 ~02:17 Failing path; one or more paths failing
   Sep-19 ~02:18 I/O error; no more path available
   Sep-19 ~02:30 everything quiet down
   Sep-19 ~12:42 Failing path; one or more paths failing
   Sep-19 ~12:43 I/O error; no more path available
   Sep-19 ~12:53 Failing path; one or more paths failing
   Sep-19 ~12:54 I/O error; no more path available

----------------------------------------------------

We are evaluating from Qlogic  why QLE 8242 card is taking a firmware core dump but they haven't come up with analysis yet.


So it seems that data corruption is reported whenever a path is lost as well. 

Since other OS's are working fine expect Redhat linux the suspicion is either the device mapper / iSCSI stack is passing incorrect status for write completion or sending us BAD data.


Answers to your queries:

It is sent to the multipath later.

Are you seeing dm multipath fail the IO or retry it? Is there more to the log?
Do you see IO errors sent from the dm/multipath layer?

>> Whenever there is a path missing due to firmware core on target port or panic test where we artifically panic a storage node we see the SCSI errors reported to dm and dm marks that path as failed.

At this time do you have more paths available? Should multipath be retrying or
should it be queueing due to not having any paths? If the latter what is the no
path retry or queue if no path setting?

>> We have set the no_path_retry to 12 and with interval of 5 secs. Essentially it will try for 60 secs and then send the error to application.

What issue in there exactly? It looks like the person in that mail you pointed
to was using the device directly and not a multiapth device. You are using dm
multipath right?

>> Yes we are using the multipathing.


Queries we have :
============================

1. Since we do not see this issue on other OS's and suspect either is dm layer or iSCSI layer in redhat how should be proceed ? 

2. We have host logs which are huge how do we provide for your analysis ?  

3. We want to have a conference call to futher discuss how to root cause this issue. Could i email you the calender request ?

4. For the harzard tool when path is lost due to firmware core on target side we see the following error. Why the kernel reports as Unhandled Error code"

---------------------
"Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:25: [sdat] Unhandled error code"
---------------------

snippet of message log
===============================
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:22: [sdap] Unhandled error code
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:22: [sdap] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:22: [sdap] CDB: Read(10): 28 00 00 9f 85 01 00 00 63 00
Sep 16 15:51:05 per610-01 kernel: end_request: I/O error, dev sdap, sector 10454273
Sep 16 15:51:05 per610-01 kernel: 00
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:25: [sdat] Unhandled error code
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:25: [sdat] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:25: [sdat] CDB: Write(10): 2a 00 00 9f 81 53 00 01 08 00
Sep 16 15:51:05 per610-01 kernel: end_request: I/O error, dev sdat, sector 10453331
Sep 16 15:51:05 per610-01 kernel: 9f
Sep 16 15:51:05 per610-01 kernel: device-mapper: multipath: Failing path 8:128.
Sep 16 15:51:05 per610-01 kernel: dc 0c
Sep 16 15:51:05 per610-01 kernel: device-mapper: multipath: Failing path 66:144.
Sep 16 15:51:05 per610-01 kernel: 00
Sep 16 15:51:05 per610-01 kernel: device-mapper: multipath: Failing path 66:208.

Comment 6 Mike Christie 2011-09-29 18:52:19 UTC
For large logs you can do this:

$lftp dropbox.redhat.com
>cd /incoming
>put unique-filename

Comment 7 Mike Christie 2011-09-29 19:37:02 UTC
> In the log, one can tell when iscsi connection dropped but I can’t tell when it


A message like:

iscsid: connection%d:%d is operational after recovery (%d attempts)

would appear in /var/log/messages.



> came back.  Anyhow, here are chunks of some events:
> 
>    Sep-17 ~18:59 test started and everything was good till… (more than a day)
>    Sep-19 ~01:47 Failing path; one or more paths failing
>    Sep-19 ~01:55 I/O error; no more path available
>


Are these messages coming from your test tool? We have not seen them in our tools or kernel.


> 2. We have host logs which are huge how do we provide for your analysis ?  
> 
> 3. We want to have a conference call to futher discuss how to root cause this
> issue. Could i email you the calender request ?
> 

Let me look at your logs and pass them to the multpath person first.


> 4. For the harzard tool when path is lost due to firmware core on target side
> we see the following error. Why the kernel reports as Unhandled Error code"
> 

It is just a bad log message. It just means we did not have a nice string to print out to further explain what the problem was, so all you get is the string about DID_TRANSPORT_FAILFAST.


Is it possible to replicate this with a very basic setup so we can dump a bunch of info like tcpdump traces or turn on all debugging in the iscsi kernel code? Maybe just 2 paths?


When you tested with qla2xxx/lpfc did you see scsi error messages and dm-multipath path failures, or did the scsi layer and driver internally retry in those cases. For the latter you might see abort or other messages about TMFs being sent, but you would not see error messages from sd or  "end_request: I/O error" messages like you did in the logs snippets you posted.

Comment 8 RHEL Product and Program Management 2011-10-07 14:53:53 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 9 Dinesh Surpur 2012-01-30 20:15:27 UTC
Closing this issue as the issue was on the target iSCSI driver causing corruption and is fixed and not a host side issue. 

Thanks.


Note You need to log in before you can comment on or make changes to this bug.