Bug 606259
Summary: | [NetApp 5.6 bug] Delayed DM-Multipath IO resumption on RHEL 5.5 | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Tanvi <tanvi> | ||||||||||||||||||||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||||||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||||||||
Version: | 5.5 | CC: | agk, andriusb, bdonahue, bmarzins, bmr, christophe.varoqui, coughlan, cward, dwysocha, grimme, heinzm, junichi.nomura, jwest, kueda, lmb, marco, marting, mbroz, mchristi, msnitzer, prajnoha, prockai, rmusil, rsarraf, syeghiay, xdl-redhat-bugzilla | ||||||||||||||||||||||||
Target Milestone: | rc | Keywords: | OtherQA, Regression | ||||||||||||||||||||||||
Target Release: | 5.6 | ||||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||
Last Closed: | 2010-11-30 19:26:06 UTC | Type: | --- | ||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||
Bug Blocks: | 557597 | ||||||||||||||||||||||||||
Attachments: |
|
Created attachment 425574 [details]
sysrq dump
Can you also please post your /etc/multipath.conf and the output of # multipath -ll Following is the multipath.conf file. defaults { user_friendly_names no max_fds max queue_without_daemon no #verbosity 4 } blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^(hd|xvd|vd)[a-z]*" devnode "^hd[a-z]" devnode " ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode " ^cciss!c[0-9]d[0-9]*" #_#_ BEGIN multipath.conf AUTOMATION ## #_#_ END multipath.conf AUTOMATION ## } devices { device { #_#_ BEGIN multipath.conf AUTOMATION ## vendor "NETAPP" product "LUN" path_grouping_policy group_by_prio features "1 queue_if_no_path" prio_callout "/sbin/mpath_prio_alua /dev/%n" path_checker directio failback immediate hardware_handler "1 alua" rr_weight uniform rr_min_io 128 getuid_callout "/sbin/scsi_id -g -u -s /block/%n" flush_on_last_del yes #_#_ END multipath.conf AUTOMATION ## } } Created attachment 425848 [details]
multipath -ll o/p
This is quite possibly the same bug as 601665. There is a z-stream fix for that already in QA. I'm actually filing another Z-stream errata today, for 607911. When I have that package built, I'll put it up on my people page, and they can test it to see if they can still reproduce the error. I take that back. This is clearly a different bug. From the logs, I can see Jun 16 13:20:59 IBMx346-200-114 multipathd: checker failed path 66:160 in map 360a98000572d4c736d34577057615458 Jun 16 13:20:59 IBMx346-200-114 kernel: sd 0:0:0:3: SCSI error: return code = 0x00010000 Jun 16 13:25:42 IBMx346-200-114 multipathd: 360a98000572d4c736d34577057615458: remaining active paths: 3 The only thing seperating these two multipathd messages is an ioctl to tell the kernel to fail the device. Which is right where we get stuck according to the sysrq dump ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8007675b>] smp_call_function_many+0x38/0x4c ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8007684c>] smp_call_function+0x4e/0x5e ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8819d4ad>] :dm_mod:dev_wait+0x0/0x83 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800947db>] on_each_cpu+0x10/0x22 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800d0f31>] __remove_vm_area+0x2b/0x42 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800d0f60>] remove_vm_area+0x18/0x25 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800d0fb4>] __vunmap+0x47/0xed ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8819deff>] :dm_mod:ctl_ioctl+0x237/0x25b ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff80042175>] do_ioctl+0x55/0x6b ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8003018e>] vfs_ioctl+0x457/0x4b9 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8004c870>] sys_ioctl+0x59/0x78 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8005d116>] system_call+0x7e/0x83 What kernel version are you running? Kernel version is 2.6.18-194.3.1.el5. Actually, the above kernel backtrace is from the waitevent loop, which shouldn't be holding any locks that keeps the checkerloop from continuing. The checkerloop's backtrace is: ESC[1;52rESC[51;1Hmultipathd S ffff8100026cd0c0 0 11239 1 11 240 11238 (NOTLB) ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H ffff810062893e48 0000000000000082 0000000000000000 0000000000 000000 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H 0000000000000220 0000000000000001 ffff8100657810c0 ffff8100026cd0c0 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H 000001082878fc2c 0000000000027192 ffff8100657812a8 000000037e693000 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1HCall Trace: ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff8005aa13>] getnstimeofday+0x10/0x28 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800637ea>] io_schedule+0x3f/0x67 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800efeb7>] sys_io_getevents+0x28c/0x350 ESC[1;51rESC[51;1H ESC[1;52rESC[51;1H [<ffffffff800ef3c6>] timeout_func+0x0/0x10 Which is odd, since it isn't where I'd expect it to be. How locked up does this machine get when it gets locked up? and how many CPUs are in this machine? (In reply to comment #9) > > Which is odd, since it isn't where I'd expect it to be. How locked up does this > machine get when it gets locked up? and how many CPUs are in this machine? The machine gets locked up for around 4 to 5 minutes during which it displays the soft lockup messages on the console. The cpuinfo is as follows: # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.318 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6400.63 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.318 cache size : 2048 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 apicid : 6 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6399.42 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.318 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6399.39 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 10 cpu MHz : 3200.318 cache size : 2048 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 apicid : 7 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm bogomips : 6399.36 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: Update: This issue is blocking NetApp's certification/qualification of RHEL 5.5 and all customer upgrades... correct? (In reply to comment #11) > Update: This issue is blocking NetApp's certification/qualification of RHEL 5.5 > and all customer upgrades... correct? Correct. So, is this a regression? Are you able to run this same test on a RHEL 5.4 system, without reproducing the error. If so, it might be useful to try to isolate the error. If you have a RHEL 5.4 system that can't reproduce this error, it would be a big help if you were able to install the RHEL 5.5 device-mapper and device-mapper-multipath packages on it, and test to see if it reproduces with the 5.5 userspace code, and the 5.4 kernel code. Assuming that it doesn't, it would be nice to step through some various kernel builds to figure out when the problem was introduced. Are you able to do this? (In reply to comment #13) > So, is this a regression? Are you able to run this same test on a RHEL 5.4 > system, without reproducing the error. If so, it might be useful to try to > isolate the error. If you have a RHEL 5.4 system that can't reproduce this > error, it would be a big help if you were able to install the RHEL 5.5 > device-mapper and device-mapper-multipath packages on it, and test to see if it > reproduces with the 5.5 userspace code, and the 5.4 kernel code. > > Assuming that it doesn't, it would be nice to step through some various kernel > builds to figure out when the problem was introduced. > > Are you able to do this? It does look like a regression. Anyways, we'll try this on RHEL 5.4 as you suggested and update you with the results. Meanwhile do you want any more logs/dumps for analyzing this? The problem is that from what I can tell, the soft lockup trace you're seeing is just a symptom of the problem. At the end of the waiteventloop's DM_DEVICE_WAITEVENT ioctl, it tries to free the params. To do this it needs to tell all the CPUs to flush the pages, but apparently at least one of the CPUs is busy. Unfortunately, I don't see any indication of what that CPU is stuck on, and I don't know a good way to find out while the machine is locked up. Actually, what would be helpful is an sosreport from a setup where you can reproduce this. This looks like a kernel issue, but I can't say where in the kernel the problem is, so it would be helpful to see what modules are installed and being used. Created attachment 428791 [details]
sosreport
Have you been able to try this with a 5.4 kernel and 5.5 userspace packages? Yes, the same IO tests ran successfully on the following configs: 1) RHEL 5.4 with native packages: kernel-2.6.18-164.el5 device-mapper-1.02.32-1.el5 device-mapper-multipath-0.4.7-30.el5 2) RHEL 5.4 with 5.5 device-mapper native packages: kernel-2.6.18-164.el5 device-mapper-1.02.39-1.el5 device-mapper-multipath-0.4.7-34.el5 Just to verify that this is really a kernel bug, and not some other package, it would be nice if you could either load a RHEL 5.5 kernel onto that RHEL 5.4 machine and see if you can reproduce it, or load a RHEL 5.4 kernel onto a RHEL 5.5 machine and see if it goes away. I'm going to try to setup a machine in-house that can reproduce this. The debugging information I have so far does give me much clue as to what could be the root cause of the hang, but if I had to guess, I would guess that it's in the scsci layer. If you reproduce this again, can you run sysrq-t and sysrq-w when the machine is unresponsive again. I know that you already gave me this information once, and nothing looked amiss to me, but there seemed to be a number of missing processes in the sysrq-t output, and I'm hoping that the sysrq-w output might show something more useful this time. Otherwise, would you be willing to try some different kernel packages, to see if we can narrow down where the regression was introduced? If I can get a working reproducer, I'll be trying that here. Another possibility that might help would be installing the kernel-debug rpm, and trying to reproduce this with the debug kernel, to see if we get any more useful information. (In reply to comment #22) > Another possibility that might help would be installing the kernel-debug rpm, > and trying to reproduce this with the debug kernel, to see if we get any more > useful information. Tom would like me to try to set up a system in house for you to use. What would be the requirements for the system? #CPU's, backend storage config etc. I should be able to cobble together something in Westford. (In reply to comment #23) > Tom would like me to try to set up a system in house for you to use. What > would be the requirements for the system? #CPU's, backend storage config etc. I > should be able to cobble together something in Westford. It's a 4 CPU x86_64 machine, with 2 FC ports (either emulex or qlogic HBAs should work) connected to a 2 controller Netapp box. The node is setup with multipathed root and boot filesystems, using devices on the Netapp box. Other than that, I just need a couple of other LUNs on the netapp box for multipath to work with. It should be installed with the latest RHEL-5.5.z code. If you want more specifics, you should be able to get all the information you need from the sosreport. How do you perform the controller faults in your tests. I have a setup that is reasonably similar, and I'm going to try and reproduce this error, so any information on how to reproduce it would be helpful. (In reply to comment #25) > How do you perform the controller faults in your tests. I have a setup that is > reasonably similar, and I'm going to try and reproduce this error, so any > information on how to reproduce it would be helpful. We use root on dm-multipath hosts (i.e. SANbooted) for all our tests. Run heavy IO on NetApp dm-multipathed data luns mapped to the RHEL host connected to the clustered NetApp controllers. For controller faults: 1) Run a 'cf takeover -f' on one controller head. This would takeover the partner head. Wait for 10 mts for the takeover to complete. 2) After the above 10 mts, run a 'cf giveback -f' on this controller head. This would relinquish control back to the partner head. Wait for 10 mts for the giveback to complete. 3) Now repeat the same steps on the partner head. 4) Run the above in a loop. We also include FC switch port block/unblock for the corresponding initiators/targets as part of our tests. For all these cases, verify that IO properly failover/failback within the stipulated limits - for our case, we set the limit to 120 seconds i.e. any failover/failback that exceeds 120 seconds is a failed test case for us. (In reply to comment #21) > Just to verify that this is really a kernel bug, and not some other package, it > would be nice if you could either load a RHEL 5.5 kernel onto that RHEL 5.4 > machine and see if you can reproduce it, or load a RHEL 5.4 kernel onto a RHEL > 5.5 machine and see if it goes away. > > If you reproduce this again, can you run sysrq-t and sysrq-w when the machine > is unresponsive again. I know that you already gave me this information once, > and nothing looked amiss to me, but there seemed to be a number of missing > processes in the sysrq-t output, and I'm hoping that the sysrq-w output might > show something more useful this time. Will do. > > Otherwise, would you be willing to try some different kernel packages, to see > if we can narrow down where the regression was introduced? > Send us the kernel packages as well. We'll try those as well. I tried to reproduce this all yesterday and overnight on a similar setup, using the instructions from comment 26, without success. How long does this ususally take to reproduce? Does is usually reproduce when you're disabling the multipath paths by downing the switch ports, or by performing controller faults. My test just did the controller takeover and giveback. I uploaded a bunch of kernels to http://people.redhat.com/bmarzins/kernels/ kernel-2.6.18-164.el5.x86_64.rpm is the RHEL 5.4 kernel kernel-2.6.18-194.3.1.el5.x86_64.rpm is the RHEL 5.5 kernel that you were using kernel-debug-2.6.18-194.3.1.el5.x86_64.rpm is the debug version of that kernel. kernel-2.6.18-171.el5.x86_64.rpm, kernel-2.6.18-179.el5.x86_64.rpm, and kernel-2.6.18-187.el5.x86_64.rpm some other kernels to try. First off it would be great if you could try the 5.5 kernel on a 5.4 machine, or a 5.4 kernel on a 5.5 machine, so we can verify that it really is the kernel that makes the difference. (In reply to comment #28) > I tried to reproduce this all yesterday and overnight on a similar setup, using > the instructions from comment 26, without success. > > How long does this ususally take to reproduce? > > Does is usually reproduce when you're disabling the multipath paths by downing > the switch ports, or by performing controller faults. My test just did the > controller takeover and giveback. We generally hit it withing 4-5 iteration of controller fault(takeover/giveback). Host goes to unresponsive state for 4-5 minutes only when it is performing IO. While the mix-match tests are running, I'd like to put my findings in. Ben, Do you think this can be a file system issue ? As per the tests here internally, with file systems, the bug is triggered within hours. I did some tests and noticed that during I/O a sync can take a very high amount of time. This is seen especially when doing buffered I/O. sync goes into stall for 20-30 minutes. This behavior is seen on the 2.6.35 rc kernels. Please see DaveC's commit on the same. http://lkml.org/lkml/2010/4/19/410 That may well be. My testing still hasn't hit anything all weekend, but aside from the root filesystem, all my IO was straight to the multipath devices. I'll go look at the sosreport, and setup my devices like that. Also, what are you using to generate the IO? To generate the IO, we use "dt". I believe this tool has been shared with Red Hat. While doing IO, we specifically disable fsync in dt with the "disable=fsync" switch. dt has the intelligence to detect IO delays. In our tests, the number of seconds is set to 120 after which dt warns/bails out if IO does not take place. I also used fio to run some tests. When using buffered IO, if you initiate a sync, it goes into a very long wait as explained in comment #31 The settings used for fio are: [global] ioengine=libaio buffered=1 rw=randwrite bs=256k size=2048m directory=/tmp [file1] iodepth=4 [file2] iodepth=32 [file3] iodepth=8 [file4] iodepth=16 I think the bug is fast triggered if (size x 4) is greater than the size of the memory on the test machine. You can use either tool. Hopefully it should help you reproduce the bug. Is the dt program you are using the same as the Data Test program available at http://www.scsifaq.org/RMiller_Tools/dt.html You are running dt on files on multipathed filesystems, correct? What are the parameters you are running dt with? I can't see any options to enable the dt program from http://www.scsifaq.org/RMiller_Tools/dt.html to detect IO delays. Am I missing something, or do you use another program. If so, can someone please point me at where I can get it? I believe the author of dt now works for NetApp. They probably have a newer version of the program. (In reply to comment #35) > I can't see any options to enable the dt program from > http://www.scsifaq.org/RMiller_Tools/dt.html to detect IO delays. Am I missing > something, or do you use another program. If so, can someone please point me > at where I can get it? The dt utility mentioned in Robin Miller's web page seems quite old - it does not have the options to detect IO delays. I'll send an updated dt version for the same. (In reply to comment #32) > That may well be. My testing still hasn't hit anything all weekend, but aside > from the root filesystem, all my IO was straight to the multipath devices. > I'll go look at the sosreport, and setup my devices like that. > Won't the 'directio' path checker use fs requests to check a path, unlike other path checkers like 'tur'? Created attachment 431466 [details]
dt v16.19 source tar ball
Untar the above tar ball, and then compile the dt utility as follows:
# pwd
/root/tmp/dt.d
# cd linux2.6-x86/
# make -f ../Makefile.linux VPATH=..
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dt.o ../dt.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtaio.o ../dtaio.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtfifo.o ../dtfifo.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtgen.o ../dtgen.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtinfo.o ../dtinfo.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtmmap.o ../dtmmap.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtprocs.o ../dtprocs.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtread.o ../dtread.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtwrite.o ../dtwrite.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtstats.o ../dtstats.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dttape.o ../dttape.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dttty.o ../dttty.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtutil.o ../dtutil.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -c -o dtusage.o ../dtusage.c
loading dt ...done
# ./dt version
--> Date: April 3rd, 2009, Version: 16.19, Author: Robin T. Miller <--
Controller config: Use the latest DATA ONTAP v7.3.3 available on the NOW site on the NetApp controllers. Host config: In addition to the root lun mapped from anyone of the NetApp controller heads, map 10 data luns (5 from each head) of 5G size each to the RHEL host. Ensure ALUA is enabled on the corresponding host igroup on each controller head by running "igroup set -f <igroup_name> alua yes". On the RHEL 5.5 root on multipath host, configure dm-multipath on these 10 data luns and then create a single lv stripe (of say 45G) on all these multipath devices (total size of 50G). Mount an ext3 fs on this device. Note - I suspect there are limitations in the ALUA hardware handler for handling port transitioning - so please turn it off in the multipath.conf i.e use the default setting of 0 itself for the hardware handler. Remaining settings as per comment #3. Assuming the lv stripe is of 45G, run dt on this striped volume as follows: ./dt of=/mnt/tmp/TestFile iotype=random limit=10g runtime=24h enable=compare oncerr=abort errors=1 procs=4 disable=fsync oflags=trunc incr=var min=2k max=64k dispose=keep pattern=iot enable=debug enable=syslog enable=noprog noprogt=120s noprogtt=120s alarm=3s log=/var/log/dt_log disable=pstats This would ensure 4 threads would pump IO (10G each) on the lv device. After the dt IO threads complete, check the dt logs for the results (/var/log/dt_log as given above). During IO on the host, run forced takeover/givebacks (in a loop) on the NetApp controller heads. (In reply to comment #39) > (In reply to comment #32) > > That may well be. My testing still hasn't hit anything all weekend, but aside > > from the root filesystem, all my IO was straight to the multipath devices. > > I'll go look at the sosreport, and setup my devices like that. > > > > Won't the 'directio' path checker use fs requests to check a path, unlike other > path checkers like 'tur'? Yes, it does regular IO, but it does it to the block device, not to a filesystem on top of the block device. Also, I was only thinking about the IO I was running to put the machine under load. However, I haven't had any luck reproducing it with my own IO loads using a filesystem setup similar to your sosreport's. I'm going to try with dt now. Ben, There's this other bug that you own, https://bugzilla.redhat.com/show_bug.cgi?id=606801, with similar behavior. We have run some tests (both take/give and port offline) cutting down the file system and the problem is not seen yet. Been running successfully for 32 hrs. The tests are; running dt on top of multipathed devices and triggering faults on the NetApp target. Will let you know how the full 48hrs cycle goes. And this test is on iSCSI. (In reply to comment #43) > Ben, > > There's this other bug that you own, > https://bugzilla.redhat.com/show_bug.cgi?id=606801, with similar behavior. > > We have run some tests (both take/give and port offline) cutting down the file > system and the problem is not seen yet. Been running successfully for 32 hrs. > > The tests are; running dt on top of multipathed devices and triggering faults > on the NetApp target. > > Will let you know how the full 48hrs cycle goes. I have a system setup pretty much identical to the specifications in comment 41, and I've been running the tests on it for a while now. I still cannot reproduce this issue at all. Have you been able to try some of the other kernels? (In reply to comment #43) > The tests are; running dt on top of multipathed devices and triggering faults > on the NetApp target. > > Will let you know how the full 48hrs cycle goes. The 48hrs tests completed successfully and there was no delayed IO seen (the setup as described in Comment #43) (In reply to comment #46) > The 48hrs tests completed successfully and there was no delayed IO seen (the > setup as described in Comment #43) Actually the above 48 hour test which completed successfully as mentioned above was on a iSCSI setup with fs disabled. Ben, We've hit the delayed IO consistently on a FC setup with the ALUA handler enabled & with fs on the multipathed devices. But we are yet to verify whether this is seen without the handler or without the fs. We support implicit ALUA alone with the following valid states: sd 0:0:1:0: alua: port group 00 state A supports ToUsNA But the ALUA handler has the following code snippet: if (h->tpgs & TPGS_MODE_EXPLICIT) { switch (h->state) { case TPGS_STATE_TRANSITIONING: /* State transition, retry */ goto retry; break; case TPGS_STATE_OFFLINE: /* Path is offline, fail */ err = SCSI_DH_DEV_OFFLINED; break; default: break; } } else { /* Only Implicit ALUA support */ if (h->state == TPGS_STATE_OPTIMIZED || h->state == TPGS_STATE_NONOPTIMIZED || h->state == TPGS_STATE_STANDBY) /* Useable path if active */ err = SCSI_DH_OK; else /* Path unuseable for unavailable/offline */ err = SCSI_DH_DEV_OFFLINED; } During controller faults, the lun is in 'transitioning' state. But from the above code, it seems this is handled for explicit ALUA alone i.e. the ALUA handler does not treat 'transitioning' properly. Do you think this could have contributed to the delayed IO? (In reply to comment #48) > During controller faults, the lun is in 'transitioning' state. But from the > above code, it seems this is handled for explicit ALUA alone i.e. the ALUA > handler does not treat 'transitioning' properly. > > Do you think this could have contributed to the delayed IO? It's quite possible that this is related. Have you been able to try this without alua enabled? But I thought in your multipath.conf, you disabled the alua hardware handler, like you mentioned in comment #41. Does the hardware handler code still get called if you have don't enable it in multipath.conf? Created attachment 434375 [details]
Script to start IO and run faults on netapp controllers
Ben,
We haven't hit the issue without alua enabled.
We are hitting this issue with 8G initiators (QLogic and Emulex) with ALUA enabled on both host and controllers and hardware_handler set to "0". Here are the steps for recreating -
1. Map 40 LUNs, 20 from each head.
2. Create a VG on top of all 40 LUNs.
3. Create 3 LVs striped across all 40 LUNs and mount them.
4. Copy the dt tool to /sbin/dt and provide exec permission.
5. Copy the attached script to the host at /root/fault.pl
6. Start the script and enter the details when asked.
7. Verify that the IO is happening on the devices.
The script will log the status and dt logs to /root/io_log/ directory. You can monitor the script status though /root/io_log/Status file. The Status file will have logs about IO delays, if the issue gets reproduced. Once the issue is seen, please kill the script.
The 3040 we have in Westford has 2Gb adapters in it currently - would that be sufficient to test this on the target side? Created attachment 434617 [details]
FC setup details
We have seen this issue with both 4Gb and 8Gb HBAs on the controllers. We have not yet tested the same with 2Gb. But you can surely give a try. Attached is the doc having FC setup details.
The FAS3040 controllers may have on-board 4Gb adapters. Please run the command "sysconfig -av 0" to check if the on-board cards support 4Gb speed.
(In reply to comment #50) > It's quite possible that this is related. Have you been able to try this > without alua enabled? Well, it seems the delay is hit only in ALUA setups as mentioned by Raj in comment #52. (In reply to comment #51) > But I thought in your multipath.conf, you disabled the alua hardware handler, > like you mentioned in comment #41. Does the hardware handler code still get > called if you have don't enable it in multipath.conf? No, I don't think so. Though the scsi_dh_alua is always loaded during bootup, multipath -ll still lists the hw_handler as '0' (if disabled in the multipath.conf). So it does not seem the handler gets called. However since this is seen with ALUA alone, I suspect the kernel ALUA state transition handling itself to be at fault here which may have triggered the delay. The following messages are seen in the /var/log/messages during the controller faults: kernel: sd 0:0:1:0: timing out command, waited 360s kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition This goes back to bug 559586. During NetApp controller faults on ALUA enabled setups, the target replies with a NOT READY status and an ASC/ASCQ combination of LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION. And this SCSI error code is not handled in scsi_error.c, but in the ALUA handler alone. But the ALUA handler itself has got issues handling implicit ALUA port transitioning as mentioned in comment #48 - that's why we disabled it. So we retried the tests after patching the kernel as follows (same patch mentioned in the above bug): diff -urpN kernel-2.6.18/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c kernel-2.6.18-new/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c --- kernel-2.6.18/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c 2010-07-27 12:46:52.000000000 +0530 +++ kernel-2.6.18-new/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c 2010-07-27 12:56:46.000000000 +0530 @@ -365,7 +365,8 @@ static int scsi_check_sense(struct scsi_ * if the device is in the process of becoming ready, we * should retry. */ - if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01)) + if ((sshdr.asc == 0x04) && + (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a)) return SCSI_MLQUEUE_DIS_DEV_RETRY; /* * if the device is not started, we need to wake And initial tests have looked promising so far - IO delays have reduced considerably. Adding Mike Christie who was involved in bug 559586. Filed bug 619361 for the 'transitioning' issue in the SCSI ALUA handler. So perhaps to address the delayed IO here, we may have to get the above patch mentioned in comment #57 into the RHEL kernel or even better - get bug 619361 fixed & then start using the SCSI ALUA handler. (In reply to comment #56) > (In reply to comment #51) > > But I thought in your multipath.conf, you disabled the alua hardware handler, > > like you mentioned in comment #41. Does the hardware handler code still get > > called if you have don't enable it in multipath.conf? > > No, I don't think so. Though the scsi_dh_alua is always loaded during bootup, > multipath -ll still lists the hw_handler as '0' (if disabled in the > multipath.conf). So it does not seem the handler gets called. scsi_dh_alua will not even get loaded if multipath.conf isn't configured to use alua: hardware_handler "1 alua" That said, it may be autoloading for you via the initrd (for use with mpath root device). Certainly a potential pitfall for systems that aren't using mpath root: seems we can't rely on scsi_dh_alua being available. (In reply to comment #59) > Filed bug 619361 for the 'transitioning' issue in the SCSI ALUA handler. > > So perhaps to address the delayed IO here, we may have to get the above patch > mentioned in comment #57 into the RHEL kernel or even better - get bug 619361 > fixed & then start using the SCSI ALUA handler. Right, I agree with the "even better" approach of fixing bug#619361. The patch from comment#57 is certainly insightful but handling an ALUA specific condition in the SCSI midlayer's error handling seems misplaced. But its always nice to have a known workaround. (In reply to comment #57) > The following messages are seen in the /var/log/messages during the controller > faults: > > kernel: sd 0:0:1:0: timing out command, waited 360s > kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready > kernel: Add. Sense: Logical unit not accessible, asymmetric access state > transition So with scsi_dh_alua loaded it looks like we retried the IO for 6 minutes but kept getting that the dev is transitioning???? > > And initial tests have looked promising so far - IO delays have reduced > considerably. With just your scsi_check_sense patch and not scsi_dh_alua do you still see kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready kernel: Add. Sense: Logical unit not accessible, asymmetric access state transition (still see the not ready error but no longer see the timing out messages before it) MikeS pointed that scsi_dh_lua uses SCSI_MLQUEUE_IMM_RETRY and your patch uses SCSI_MLQUEUE_DIS_DEV_RETRY. The difference is that SCSI_MLQUEUE_IMM_RETRY retries for the max command runtime of retries * cmd_timout (see drivers/scsi/scsi_lib.c:scsi_attempt_requeue_command for that timing out command message). And SCSI_MLQUEUE_DIS_DEV_RETRY retries only 5 times. So if scsi-ml is handling the retries right for IMM_RETRY your patch should only be failing the IO faster. Created attachment 435525 [details]
Script to start IO and run faults on netapp controllers
Please use this script for starting the tests. There was an issue with an argument for dt in the previous script.
I'm fairly certain I've reproduced this bug. Interestingly, I did so _without_ any heavy IO from dt (in fact I wasn't doing any IO aside from regularly running multipath -ll). I started by adding scsi_dh_alua debugging to the latest 5.5.z stream kernel, that kernel is available here: http://people.redhat.com/msnitzer/RPMS/bz606259/kernel-2.6.18-194.11.1.el5.alua_dbg.x86_64.rpm I then proceeded to do some basic failure tests while connected to the first NetApp controller: 1) fcp stop 2) fcp start I issued multipath -ll throughout to collect alua_check_sense data. When all paths were back and the sense_data and scsi errors stopped I proceeded to fail over the 2nd controller to the 1st with: 1) cf takeover 2) cf giveback During giveback I noticed that multipath -ll stalled just after having printed various checker down messages: sdaa: checker msg is "directio checker reports path is down" sdab: checker msg is "directio checker reports path is down" sdac: checker msg is "directio checker reports path is down" ... I then saw a tremendous amount of NOT_READY/0x04/0x0a messages (from the debugging I added to scsi_dh_alua's alua_check_sense): sd 1:0:3:4: alua_check_sense: NOT_READY/0x04/0x0a sd 1:0:3:8: alua_check_sense: NOT_READY/0x04/0x0a sd 1:0:3:9: alua_check_sense: NOT_READY/0x04/0x0a ... Total duration of this NOT_READY/0x04/0x0a flood was ~8 minutes: Jul 30 23:09:39 pnate kernel: sd 1:0:3:0: alua_check_sense: NOT_READY/0x04/0x0a ... Jul 30 23:17:08 pnate kernel: sd 1:0:3:6: alua_check_sense: NOT_READY/0x04/0x0a While this was occurring the NetApp controller did not appear to be happy (note that the netapp timestamps are exactly an hour behind the Linux host): na3040a(takeover)> cf giveback Fri Jul 30 22:09:36 EST [na3040a (takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator na3040a(takeover)> Fri Jul 30 22:09:36 EST [na3040a: cf.fm.givebackStarted:warning]: Cluster monitor: giveback started Fri Jul 30 22:09:37 EST [na3040b/na3040a: iscsi.service.shutdown:info]: iSCSI service shutdown Fri Jul 30 22:09:37 EST [na3040b/na3040a: fcp.service.shutdown:info]: FCP service shutdown Fri Jul 30 22:09:37 EST [na3040a: netif.linkInfo:info]: Ethernet e0c: Link configured down. Fri Jul 30 22:09:37 EST [na3040a: cf.rsrc.transitTime:notice]: Top Giveback transit times wafl=400 {sync_clean=156, giveback_sync=125, forget=106, finish=12, vol_refs=1, mark_abort=0, wait_offline=0, wait_create=0, abort_scans=0, drain_msgs=0}, wafl_gb_sync=190, raid=129, ndmpd=91, sanown_replay=32, registry_giveback=24, nfsd=11, java=5, vdisk=3, ems=2 Fri Jul 30 22:09:37 EST [na3040a: asup.msg.giveback.delayed:info]: giveback AutoSupport delayed 5 minutes (until after the giveback process is complete). Fri Jul 30 22:09:37 EST [na3040a: cf.fm.givebackComplete:warning]: Cluster monitor: giveback completed Fri Jul 30 22:09:37 EST [na3040a: cf.fm.givebackDuration:warning]: Cluster monitor: giveback duration time is 1 seconds Fri Jul 30 22:09:37 EST [na3040a: cf.fsm.stateTransit:warning]: Cluster monitor: TAKEOVER --> UP Fri Jul 30 22:09:38 EST [na3040a: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of na3040a by na3040b disabled (unsynchronized log) Fri Jul 30 22:09:39 EST [na3040a: cf.fm.timeMasterStatus:info]: Acting as cluster time slave Fri Jul 30 22:09:40 EST [na3040a: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of na3040b disabled (partner booting) Fri Jul 30 22:09:50 EST [na3040a: cf.partner.short_uptime:warning]: Partner up for 3 seconds only Fri Jul 30 22:09:51 EST [na3040a: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of na3040b disabled (unsynchronized log) Fri Jul 30 22:10:50 EST [na3040a: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #5 SEND_DESC_ERROR 2 Fri Jul 30 22:10:50 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:10:54 EST [na3040a: cf.rv.notConnected:error]: Connection for cfo_rv failed Fri Jul 30 22:11:00 EST [na3040a: cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP Fri Jul 30 22:12:03 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:13:17 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:14:30 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout na3040a> na3040a> na3040a> na3040a> Fri Jul 30 22:16:23 EST last message repeated 2 times Fri Jul 30 22:17:04 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:17:45 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:18:23 EST [na3040a: asup.post.sent:notice]: Cluster Notification message posted to NetApp: Cluster Notification from na3040a (CLUSTER GIVEBACK COMPLETE) INFO Fri Jul 30 22:18:26 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:19:08 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout Fri Jul 30 22:19:47 EST [na3040a: cf.rv.nicReset:error]: Reset cluster interconnect(s) due to unsynchronized log Fri Jul 30 22:19:47 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous fm timeout Fri Jul 30 22:19:47 EST [na3040a: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN Fri Jul 30 22:19:56 EST [na3040a: cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP Fri Jul 30 22:19:58 EST [na3040a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of na3040b enabled Fri Jul 30 22:19:58 EST [na3040a: cf.fsm.takeoverByPartnerEnabled:notice]: Cluster monitor: takeover of na3040a by na3040b enabled Fri Jul 30 22:20:06 EST [na3040a: monitor.globalStatus.ok:info]: The system's global status is normal. Created attachment 435721 [details] log that shows netapp commands and output Full netapp log from when I triggered the faults on the netapp controller that I discussed in comment#64 Created attachment 435722 [details]
gzip'd messages during fcp stop/start
Created attachment 435723 [details]
gzip'd messages during cf takeover/givenback
(In reply to comment #64) > I'm fairly certain I've reproduced this bug. Interestingly, I did so _without_ > any heavy IO from dt (in fact I wasn't doing any IO aside from regularly > running multipath -ll). Ben clarified that multipath -ll will hang waiting for all recovering paths. So I clearly need to have active IO (via dt, etc) to see if the active paths are still responsive even though the one controller is taking 7-8 minutes to recover (and the giveback to complete). > I started by adding scsi_dh_alua debugging to the latest 5.5.z stream kernel, > that kernel is available here: > http://people.redhat.com/msnitzer/RPMS/bz606259/kernel-2.6.18-194.11.1.el5.alua_dbg.x86_64.rpm In parallel to my continued testing it would be great if NetApp could try reproducing with this "2.6.18-194.11.1.el5.alua_dbg" kernel. There will be a flood of scsi_dh_alua debugging but it should help shed light on whether the kernel is seeing LUNs in the "NOT_READY/0x04/0x0a" state for extended periods and is properly retrying as required. > Total duration of this NOT_READY/0x04/0x0a flood was ~8 minutes: > Jul 30 23:09:39 pnate kernel: sd 1:0:3:0: alua_check_sense: NOT_READY/0x04/0x0a > ... > Jul 30 23:17:08 pnate kernel: sd 1:0:3:6: alua_check_sense: NOT_READY/0x04/0x0a > > While this was occurring the NetApp controller did not appear to be happy We'd like to understand if this is a known issue with the NetApp controller (and the motivation for waiting 10 minutes between takeover and giveback). (In reply to comment #64) > During giveback I noticed that multipath -ll stalled just after having printed > various checker down messages: > sdaa: checker msg is "directio checker reports path is down" > sdab: checker msg is "directio checker reports path is down" > sdac: checker msg is "directio checker reports path is down" I think the use of the directio path checker could cause problems. IOs like a READ can cause path transitions with implicit failover. Does using a TUR for the path checker help? (In reply to comment #69) > (In reply to comment #64) > > During giveback I noticed that multipath -ll stalled just after having printed > > various checker down messages: > > sdaa: checker msg is "directio checker reports path is down" > > sdab: checker msg is "directio checker reports path is down" > > sdac: checker msg is "directio checker reports path is down" > > I think the use of the directio path checker could cause problems. IOs like a > READ can cause path transitions with implicit failover. Does using a TUR for > the path checker help? I'll keep your suggestion in mind but I can no longer reproduce the long 7-8 minute giveback that resulted in a NOT_READY/0x04/0x0a flood. So I don't know whether or not directio was inducing the netapp controller failures during giveback. I have been running the takeover/giveback load test script from comment#63 for nearly 30 hours. (I modified the script slightly to a single dt instance that uses 4 dt processes each doing 5G of IO). I have not been able to reproduce this delayed IO issue. We are currently running with the ALUA debug kernel and will update you with the results. (In reply to comment #69) > > I think the use of the directio path checker could cause problems. IOs like a > READ can cause path transitions with implicit failover. Does using a TUR for > the path checker help? So are you saying there are issues with directio compared to other checkers like tur? We switched to directio because of bug 247443. (In reply to comment #73) > (In reply to comment #69) > > > > I think the use of the directio path checker could cause problems. IOs like a > > READ can cause path transitions with implicit failover. Does using a TUR for > > the path checker help? > > So are you saying there are issues with directio compared to other checkers > like tur? We switched to directio because of bug 247443. directio does READ IO right? So with implicit failover, if dm-multipath is trying to use the paths in the secondary path group (it is sending READs/WRITES), but the path checker is sending IO to the primary paths could the IO to both groups cause each other to transition. So we end up ping ponging and each group keeps causing the other to group transition? Or will it take X IOs before a transition occurs? You are correct that the directio checker sends IO to all the paths to verify that they are active. I have no idea if this could cause a ping-pong, but if you can't reproduce this on earlier releases it seems less likely, since there hasn't been any change to how the directio code works. But just to confirm, with implicit ALUA, will you have a problem with IO going to the non-optimal paths? What are your results from running with the debug kernel? (In reply to comment #76) > What are your results from running with the debug kernel? Martin - any progress based on the comments? AFAIK NetApp's testing with their latest firmware and the transitioning patch from Bug# 619361 has proven stable. As such I believe this BZ has been resolved indirectly. *** This bug has been marked as a duplicate of bug 619361 *** Sorry to ask on an already closed bug. But I observed a similar behaviour with RHEL5.6 (2.6.18-238.9.1.el5) root on NetApp and 50 Luns on same Netapp. Failover is configured via ALUA. While making a controller takeover the machine becomes unresponsive for about 10 Minutes. Can anybody confirm that this bug is not been seen since the release of 5.6? Thanks for you help. Marc. |
Created attachment 425573 [details] log messages Description of problem: When we have a few LUNs (51 Luns * 4 paths each = 204 devices) mapped to the hosts and IO is running (IO is running on 3 LVs on a volume group containing the 50 multipath devices), we observe that the IO resumption takes more than 4 to 5 minutes when there are controller faults. The host is a SANbooted host. This is observed with both Emulex and QLogic HBAs. Version-Release number of selected component (if applicable): device-mapper-event-1.02.39-1.el5_5.2 device-mapper-multipath-0.4.7-34.el5_5.1 device-mapper-1.02.39-1.el5_5.2 How reproducible: Always Steps to Reproduce: 1. Map 50 Luns (25 from each controller head with 4 paths each) to a RHEL 5.5 host 2. Scan the devices and create a volume group on the dm-multipath devices. 3. Create LVs on the volume group and start IO. 4. Start controller faults which make some FC paths go down. Actual results: The IO takes more than 4 to 5 minutes to switch to the remaining available paths to the LUNs. The host becomes unresponsive at this moment. Expected results: IO should switch to other available paths within a reasonable amount of time, say 60 Seconds and the host should be responsive. Additional info: Attached is a zip file with - 1. sysrq dumps taken while the host is unresponsive 2. /var/log/messages file.