Bug 479213 - On BFS setup DM loses and does not regain devices when Bus Reset is recieved on QL hba
On BFS setup DM loses and does not regain devices when Bus Reset is recieved ...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Marcus Barrow
Red Hat Kernel QE team
: OtherQA
Depends On:
Blocks: 451642 460170 483784
  Show dependency treegraph
 
Reported: 2009-01-07 21:21 EST by Hector Arteaga
Modified: 2009-06-20 05:04 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-02-09 17:32:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Server Logs (151.95 KB, application/x-zip-compressed)
2009-01-07 21:24 EST, Hector Arteaga
no flags Details

  None (edit)
Description Hector Arteaga 2009-01-07 21:21:08 EST
Description of problem:
During hazard c16 testing on my boot from SAN configruation, bus resets are issue to the HBAs.  When that happens, device mapper complains about paths disapearing.  The devices disappear and do not come back.  Eventually hazard issues another bus reset to the other device causing the remaining paths to disapear (2 ported HBA) and the sytem to lose lose access to its OS LUN.  The OS then goes to read-only mode.  This is seen on a system with Qlogic HBAs.  In the same setup I have a server with Emulex HBAs that does not exhibit the same behavior.

Version-Release number of selected component (if applicable):
RedHat 5.3 RC1 (xen)

How reproducible:
readily reproducible with hazard

Steps to Reproduce:
1.  Setup a server with Qlogic HBAs booting from SAN loading the Xen Kernel
2.  Present storage
3.  Run a hazard c16
  

Additional info:
A colleague of mine ran a similar test on a server that was booting on a local disk and did not encounter this issue.
I have attached a zip file containing the /var/log/messages, dmesg, and xend.log logs.
Comment 1 Hector Arteaga 2009-01-07 21:24:15 EST
Created attachment 328430 [details]
Server Logs

It looks like my intitial attempt at ataching the zip file failed.  Here is try #2.
Comment 2 Tom Coughlan 2009-01-14 15:25:45 EST
The pattern in the log seems to be:

Jan  6 18:04:20 cb-xen-srv11 kernel: qla2xxx 0000:44:00.1: scsi(1:2:3): LOOP RESET ISSUED.
Jan  6 18:04:29 cb-xen-srv11 kernel: qla2xxx 0000:44:00.1: qla2xxx_eh_bus_reset: reset succeded
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-0: blocked FC remote port time out: saving binding
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-1: blocked FC remote port time out: saving binding
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-2: blocked FC remote port time out: saving binding
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-3: blocked FC remote port time out: saving binding
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-4: blocked FC remote port time out: saving binding
Jan  6 18:04:36 cb-xen-srv11 kernel:  rport-1:0-5: blocked FC remote port time out: saving binding

< lots of IO errors > 

Jan  6 18:35:37 cb-xen-srv11 kernel: qla2xxx 0000:44:00.1: scsi(1:2:4): LOOP RESET ISSUED.
Jan  6 18:35:46 cb-xen-srv11 kernel: qla2xxx 0000:44:00.1: qla2xxx_eh_bus_reset: reset succeded

< lots of IO errors > 

Jan  7 10:50:01 cb-xen-srv11 syslogd 1.4.1: restart.

----------------------

So, the loop reset happens, the "blocked FC remote port" message prints immediately, then a while later there is another loop reset.  

Apparently the remote FC port never becomes un-blocked. 

This, plus the fact that this works on Emulex and not on QLogic, suggests that I should have QLogic look at it. 

By the way

1) How are you doing the bus reset?

These are not the messages I get when I do:

 echo 1 > /sys/class/fc_host/host2/issue_lip

so it must be something else.

2) Just to be clear, when you say:

> A colleague of mine ran a similar test on a server that was booting 
> on a local disk and did not encounter this issue.

Do mean that they booted from a local disk and then ran a similar test on a multipath QLogic configuration connected to similar storage?
Comment 3 Hector Arteaga 2009-01-15 14:33:35 EST
We have already gotten Qlogic involved and are looking into it.  As far as how ecactly the resets are being generated, I'm not sure.  Our test bench application (hazard) generated them.  However, if I do an sg_reset -b /dev/.... I get the same effect.
To clarify the "A colleague of mine ..." statement, my colleague was running the same test on a Qlogic multipath environment with similar storage.  The primary differences between his setup and mine are that he was booted from a local disk and he was not running the Xen kernel, just the standard RH5.3 kernel.
Comment 4 Hector Arteaga 2009-02-09 17:32:47 EST
Closing this issue.  Qlogic contaced and root caused the issue.  The FW on the virtual connect FC module in my C-Class enclosure was found to be the culprit.  A new FW with a fix for this issue to be released at a later time.

Note You need to log in before you can comment on or make changes to this bug.