Bug 471615
Summary: | multipathing problem with Emulex HBAs | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Achim Warnecke <achim.warnecke> | ||||||||||
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 5.3 | CC: | agk, bmarzins, bmr, christophe.varoqui, dwysocha, egoggin, heinzm, iannis, james.brown, junichi.nomura, kueda, lmb, mbroz, mchristi, prockai, tranlan | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | kernel-2.6.18-123.el5 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-07-27 19:12:54 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Achim Warnecke
2008-11-14 17:14:54 UTC
Is this a new setup, or did this previously work with an older version of the Emulex driver and multipath? Can you please provide the following information: # cat /etc/multipath.conf # multipath -ll Can you please start multipathd up manually using # multipathd -v3 and attach the output from /var/log/messages while you disconnect and reconnect the path. Created attachment 323617 [details]
multipath.conf file
Created attachment 323620 [details]
output multipath -ll
Created attachment 323622 [details]
start multipathd -v3
Created attachment 323624 [details]
/var/log/messages file
Helpful comments regarding /var/log/messages file The System was rebooted: Nov 14 18:47:44 x3655-lab-02 syslogd 1.4.1: restart. multipathd -v3: Nov 14 19:05:23 x3655-lab-02 multipathd: --------start up-------- port block: Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0 When you unplug the cable we see the rport dev loss timeout: Nov 14 19:15:31 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80110 x0 x0 Nov 14 19:16:01 x3655-lab-02 kernel: rport-2:0-3: blocked FC remote port time out: saving binding Nov 14 19:16:01 x3655-lab-02 kernel: rport-2:0-2: blocked FC remote port time out: saving binding Then shortly after you plug the cable back in we see this: Nov 14 19:17:40 x3655-lab-02 kernel: lpfc 0000:22:00.1: 1:1303 Link Up Event x5 received Data: x5 x13 x10 x2 x0 x0 0 If you cat out the port_state file for the rports that you see the log message for above, what is the port_state? Does it say online? What is the fast_io_fail tmo value that is being used? You can also get this in the rport's sysfs dir. The file is named fast_io_fail_tmo. This is the current status: If you cat out the port_state file for the rports that you see the log message for above, what is the port_state? Does it say online? [root@x3655-lab-02 ~]# more /sys/class/fc_host/host2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_host/host1/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/port_state Online [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/port_state Online [root@x3655-lab-02 ~]# ############################################## What is the fast_io_fail tmo value that is being used? You can also get this in the rport's sysfs dir. The file is named fast_io_fail_tmo. [root@x3655-lab-02 ~]# find / -name fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo /sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-5/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-4/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-3/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-2:0-2/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-3/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo off [root@x3655-lab-02 ~]# more /sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo off Could you try this kernel? http://people.redhat.com/dzickus/el5/123.el5/ I just tried that here and it worked fine for me. The only difference is that I was using a difference target connected to my lpfc card, and was using readsector0 path checker. There was one fix that went in recently in this code path where we did not call the dev loss and fast io fail tmo's properly. I am not sure if the fix made it into the snap you are using, but it is definately in the .123 kernel. I think we are hitting this http://marc.info/?l=linux-scsi&m=122719664311663&w=2 Can you try the patch in that mail? It should apply to our kernel with some offsets, but that can be ignored. For some reason we had to rearange our testsetup. I'll keep up testing on Wednesday. All Test Servers were upgraded to Snap3. This problem appears to have been solved in kernel-2.6.18-123.el5. If you are using a more recent kernel, and can still see this, please reopen this bug. |