Description of problem: Could not manage to get multipathing working correctly with RHEL5.3 Beta and Snap1. During I/O running, one of two paths is manually disconnected. When connecting that FC-path again it is expected that the DM-MP Multipather put the path online again, which is not the case. This happens on two Server (x3655 and pSeries) with Emulex HBA and the Pre-Release Driver 8.2.0.30. The directory </sys/class/scsi_device/2:0:1:19/device> shows that the RH OS recognized the reconnected path. So, it may be possible, that with the new Emulex HBA Driver special settings needs to be done in order to get the multipathing working?. Right now Emulex does not provide support, as 8.2.0.30 is still a Pre-Release Version. A third server with Itanium2 and Qlogic HBA did not have any problem at all. Version-Release number of selected component (if applicable): [root@x3655-lab-02 BLAST]# uname -a Linux x3655-lab-02 2.6.18-121.el5 #1 SMP Mon Oct 27 21:46:55 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Tried with different FC-paths Steps to Reproduce: 1. 2. 3. Actual results: paths did not come online again Expected results: expected to have all paths online after reconnecting Additional info:
Is this a new setup, or did this previously work with an older version of the Emulex driver and multipath? Can you please provide the following information: # cat /etc/multipath.conf # multipath -ll Can you please start multipathd up manually using # multipathd -v3 and attach the output from /var/log/messages while you disconnect and reconnect the path.
Created attachment 323617 [details] multipath.conf file
Created attachment 323620 [details] output multipath -ll
Created attachment 323622 [details] start multipathd -v3
Created attachment 323624 [details] /var/log/messages file
Helpful comments regarding /var/log/messages file The System was rebooted: Nov 14 18:47:44 x3655-lab-02 syslogd 1.4.1: restart. multipathd -v3: Nov 14 19:05:23 x3655-lab-02 multipathd: --------start up-------- port block: Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0
When you unplug the cable we see the rport dev loss timeout: Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0 Nov 14 19:16:01 x3655-lab-02 kernel: rport-2:0-3: blocked FC remote port time out: saving binding Nov 14 19:16:01 x3655-lab-02 kernel: rport-2:0-2: blocked FC remote port time out: saving binding Then shortly after you plug the cable back in we see this: Nov 14 19:17:40 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1303 Link Up Event x5 received Data: x5 x13 x10 x2 x0 x0 0 If you cat out the port_state file for the rports that you see the log message for above, what is the port_state? Does it say online? What is the fast_io_fail tmo value that is being used? You can also get this in the rport's sysfs dir. The file is named fast_io_fail_tmo.
This is the current status: If you cat out the port_state file for the rports that you see the log message for above, what is the port_state? Does it say online? [root@x3655-lab-02 ~]# more /sys/class/fc_host/host2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_host/host1/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/port_state Online [root@x3655-lab-02 ~]# ############################################## What is the fast_io_fail tmo value that is being used? You can also get this in the rport's sysfs dir. The file is named fast_io_fail_tmo. [root@x3655-lab-02 ~]# find / -name fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo off
Could you try this kernel? http://people.redhat.com/dzickus/el5/123.el5/ I just tried that here and it worked fine for me. The only difference is that I was using a difference target connected to my lpfc card, and was using readsector0 path checker. There was one fix that went in recently in this code path where we did not call the dev loss and fast io fail tmo's properly. I am not sure if the fix made it into the snap you are using, but it is definately in the .123 kernel.
I think we are hitting this http://marc.info/?l=linux-scsi&m=122719664311663&w=2 Can you try the patch in that mail? It should apply to our kernel with some offsets, but that can be ignored.
For some reason we had to rearange our testsetup. I'll keep up testing on Wednesday. All Test Servers were upgraded to Snap3.
This problem appears to have been solved in kernel-2.6.18-123.el5. If you are using a more recent kernel, and can still see this, please reopen this bug.