471615 – multipathing problem with Emulex HBAs

Bug 471615 - multipathing problem with Emulex HBAs

Summary: multipathing problem with Emulex HBAs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-14 17:14 UTC by Achim Warnecke
Modified:	2010-07-27 19:12 UTC (History)
CC List:	16 users (show)
Fixed In Version:	kernel-2.6.18-123.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-07-27 19:12:54 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
multipath.conf file (1.27 KB, text/plain) 2008-11-14 18:40 UTC, Achim Warnecke	no flags	Details
output multipath -ll (6.89 KB, application/octet-stream) 2008-11-14 18:42 UTC, Achim Warnecke	no flags	Details
start multipathd -v3 (124 bytes, text/plain) 2008-11-14 18:42 UTC, Achim Warnecke	no flags	Details
/var/log/messages file (206.65 KB, text/plain) 2008-11-14 18:44 UTC, Achim Warnecke	no flags	Details
View All

Description Achim Warnecke 2008-11-14 17:14:54 UTC

Description of problem:
Could not manage to get multipathing working correctly with RHEL5.3 Beta and
Snap1. During I/O running, one of two paths is manually disconnected. When
connecting that FC-path again it is expected that the DM-MP Multipather put the
path online again, which is not the case. This happens on two Server (x3655 and
pSeries) with Emulex HBA and the Pre-Release Driver 8.2.0.30. 
The directory </sys/class/scsi_device/2:0:1:19/device> shows that the RH OS 
recognized the reconnected path. So, it may be possible, that with the new
Emulex HBA Driver special settings needs to be done in order to get the
multipathing working?.
Right now Emulex does not provide support, as 8.2.0.30 is still a Pre-Release
Version.

A third server with Itanium2 and Qlogic HBA did not have any problem at all. 



Version-Release number of selected component (if applicable):
[root@x3655-lab-02 BLAST]# uname -a
Linux x3655-lab-02 2.6.18-121.el5 #1 SMP Mon Oct 27 21:46:55 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Tried with different FC-paths

Steps to Reproduce:
1.
2.
3.
  
Actual results:
paths did not come online again

Expected results:
expected to have all paths online after reconnecting

Additional info:

Comment 1 Ben Marzinski 2008-11-14 17:38:57 UTC

Is this a new setup, or did this previously work with an older version of the Emulex driver and multipath?

Can you please provide the following information:
# cat /etc/multipath.conf
# multipath -ll

Can you please start multipathd up manually using
# multipathd -v3

and attach the output from /var/log/messages while you disconnect and reconnect the path.

Comment 2 Achim Warnecke 2008-11-14 18:40:29 UTC

Created attachment 323617 [details]
multipath.conf file

Comment 3 Achim Warnecke 2008-11-14 18:42:05 UTC

Created attachment 323620 [details]
output multipath -ll

Comment 4 Achim Warnecke 2008-11-14 18:42:53 UTC

Created attachment 323622 [details]
start multipathd -v3

Comment 5 Achim Warnecke 2008-11-14 18:44:11 UTC

Created attachment 323624 [details]
/var/log/messages file

Comment 6 Achim Warnecke 2008-11-14 18:48:01 UTC

Helpful comments regarding /var/log/messages file

The System was rebooted:
Nov 14 18:47:44 x3655-lab-02 syslogd 1.4.1: restart.

multipathd -v3:
Nov 14 19:05:23 x3655-lab-02 multipathd: --------start up-------- 

port block:
Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0

Comment 7 Mike Christie 2008-11-17 03:08:54 UTC

When you unplug the cable we see the rport dev loss timeout:

Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0
Nov 14 19:16:01 x3655-lab-02 kernel:  rport-2:0-3: blocked FC remote port time out: saving binding
Nov 14 19:16:01 x3655-lab-02 kernel:  rport-2:0-2: blocked FC remote port time out: saving binding

Then shortly after you plug the cable back in we see this:
Nov 14 19:17:40 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1303 Link Up Event x5 received Data: x5 x13 x10 x2 x0 x0 0

If you cat out the port_state file for the rports that you see the log message for above, what is the port_state? Does it say online?

What is the fast_io_fail tmo value that is being used? You can also get this in the rport's sysfs dir. The file is named fast_io_fail_tmo.

Comment 8 Achim Warnecke 2008-11-17 11:42:12 UTC

This is the current status:

If you cat out the port_state file for the rports that you see the log message
for above, what is the port_state? Does it say online?

[root@x3655-lab-02 ~]# more /sys/class/fc_host/host2/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_host/host1/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/port_state
Online
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/port_state
Online
[root@x3655-lab-02 ~]#
##############################################

What is the fast_io_fail tmo value that is being used? You can also get this in
the rport's sysfs dir. The file is named fast_io_fail_tmo.

[root@x3655-lab-02 ~]# find / -name fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo
/sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo
off
[root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo
off

Comment 9 Mike Christie 2008-11-17 18:03:47 UTC

Could you try this kernel?
http://people.redhat.com/dzickus/el5/123.el5/

I just tried that here and it worked fine for me. The only difference is that I was using a difference target connected to my lpfc card, and was using readsector0 path checker.

There was one fix that went in recently in this code path where we did not call the dev loss and fast io fail tmo's properly. I am not sure if the fix made it into the snap you are using, but it is definately in the .123 kernel.

Comment 10 Mike Christie 2008-11-20 18:21:56 UTC

I think we are hitting this
http://marc.info/?l=linux-scsi&m=122719664311663&w=2

Can you try the patch in that mail? It should apply to our kernel with some offsets, but that can be ignored.

Comment 11 Achim Warnecke 2008-11-24 12:25:13 UTC

For some reason we had to rearange our testsetup. I'll keep up testing on Wednesday. All Test Servers were upgraded to Snap3.

Comment 14 Ben Marzinski 2010-07-27 19:12:54 UTC

This problem appears to have been solved in kernel-2.6.18-123.el5. If you are using a more recent kernel, and can still see this, please reopen this bug.

Note You need to log in before you can comment on or make changes to this bug.