Bug 606259 - [NetApp 5.6 bug] Delayed DM-Multipath IO resumption on RHEL 5.5
Summary: [NetApp 5.6 bug] Delayed DM-Multipath IO resumption on RHEL 5.5
Keywords:
Status: CLOSED DUPLICATE of bug 619361
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: 5.6
Assignee: Ben Marzinski
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 557597
TreeView+ depends on / blocked
 
Reported: 2010-06-21 09:10 UTC by Tanvi
Modified: 2011-05-11 11:24 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-11-30 19:26:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log messages (966.59 KB, application/octet-stream)
2010-06-21 09:10 UTC, Tanvi
no flags Details
sysrq dump (1.13 MB, text/plain)
2010-06-21 09:12 UTC, Tanvi
no flags Details
multipath -ll o/p (18.74 KB, text/plain)
2010-06-22 07:50 UTC, Tanvi
no flags Details
sosreport (1.46 MB, application/octet-stream)
2010-07-02 09:43 UTC, Martin George
no flags Details
dt v16.19 source tar ball (2.89 MB, application/x-gzip)
2010-07-13 13:45 UTC, Martin George
no flags Details
Script to start IO and run faults on netapp controllers (3.53 KB, application/octet-stream)
2010-07-26 10:17 UTC, Rajashekhar M A
no flags Details
FC setup details (74.49 KB, application/pdf)
2010-07-27 07:52 UTC, Rajashekhar M A
no flags Details
Script to start IO and run faults on netapp controllers (3.55 KB, application/octet-stream)
2010-07-30 10:48 UTC, Rajashekhar M A
no flags Details
log that shows netapp commands and output (9.63 KB, text/plain)
2010-07-31 04:22 UTC, Mike Snitzer
no flags Details
gzip'd messages during fcp stop/start (59.02 KB, application/x-gzip)
2010-07-31 04:24 UTC, Mike Snitzer
no flags Details
gzip'd messages during cf takeover/givenback (50.15 KB, application/x-gzip)
2010-07-31 04:25 UTC, Mike Snitzer
no flags Details

Description Tanvi 2010-06-21 09:10:50 UTC
Created attachment 425573 [details]
log messages

Description of problem:

When we have a few LUNs (51 Luns * 4 paths each = 204 devices) mapped to the hosts and IO is running (IO is running on 3 LVs on a volume group containing the 50 multipath devices), we observe that the IO resumption takes more than 4 to 5 minutes when there are controller faults. The host is a SANbooted host. This is observed with both Emulex and QLogic HBAs.


Version-Release number of selected component (if applicable):

device-mapper-event-1.02.39-1.el5_5.2
device-mapper-multipath-0.4.7-34.el5_5.1
device-mapper-1.02.39-1.el5_5.2


How reproducible:
Always


Steps to Reproduce:
1. Map 50 Luns (25 from each controller head with 4 paths each) to a RHEL 5.5 host
2. Scan the devices and create a volume group on the dm-multipath devices.
3. Create LVs on the volume group and start IO.
4. Start controller faults which make some FC paths go down.


Actual results:

The IO takes more than 4 to 5 minutes to switch to the remaining available paths to the LUNs. The host becomes unresponsive at this moment.

Expected results:

IO should switch to other available paths within a reasonable amount of time, say 60 Seconds and the host should be responsive.

Additional info:

Attached is a zip file with -

1. sysrq dumps taken while the host is unresponsive
2. /var/log/messages file.

Comment 1 Tanvi 2010-06-21 09:12:12 UTC
Created attachment 425574 [details]
sysrq dump

Comment 2 Ben Marzinski 2010-06-21 17:54:15 UTC
Can you also please post your /etc/multipath.conf and the output of

# multipath -ll

Comment 3 Tanvi 2010-06-22 07:49:56 UTC
Following is the multipath.conf file.

defaults {
        user_friendly_names no
        max_fds             max
        queue_without_daemon no
        #verbosity 4

}
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^(hd|xvd|vd)[a-z]*"

    devnode "^hd[a-z]"
    devnode " ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
    devnode " ^cciss!c[0-9]d[0-9]*"
#_#_ BEGIN multipath.conf AUTOMATION ##

#_#_ END multipath.conf AUTOMATION ##

}
devices {
device {
#_#_ BEGIN multipath.conf AUTOMATION ##
       vendor "NETAPP"
       product "LUN"
       path_grouping_policy group_by_prio
       features "1 queue_if_no_path"
       prio_callout "/sbin/mpath_prio_alua /dev/%n"

       path_checker directio
       failback immediate
       hardware_handler "1 alua"
       rr_weight uniform
     rr_min_io   128
     getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
     flush_on_last_del yes

#_#_ END multipath.conf AUTOMATION ##
}


}

Comment 4 Tanvi 2010-06-22 07:50:40 UTC
Created attachment 425848 [details]
multipath -ll o/p

Comment 6 Ben Marzinski 2010-06-25 15:31:41 UTC
This is quite possibly the same bug as 601665. There is a z-stream fix for that already in QA.  I'm actually filing another Z-stream errata today, for 607911.  When I have that package built, I'll put it up on my people page, and they can test it to see if they can still reproduce the error.

Comment 7 Ben Marzinski 2010-06-25 15:53:58 UTC
I take that back.  This is clearly a different bug.

From the logs, I can see

Jun 16 13:20:59 IBMx346-200-114 multipathd: checker failed path 66:160 in map 360a98000572d4c736d34577057615458 
Jun 16 13:20:59 IBMx346-200-114 kernel: sd 0:0:0:3: SCSI error: return code = 0x00010000
Jun 16 13:25:42 IBMx346-200-114 multipathd: 360a98000572d4c736d34577057615458: remaining active paths: 3 

The only thing seperating these two multipathd messages is an ioctl to tell the kernel to fail the device.

Which is right where we get stuck according to the sysrq dump

ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8007675b>] smp_call_function_many+0x38/0x4c
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8007684c>] smp_call_function+0x4e/0x5e
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff80076530>] do_flush_tlb_all+0x0/0x6a
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8819d4ad>] :dm_mod:dev_wait+0x0/0x83
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800947db>] on_each_cpu+0x10/0x22
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800d0f31>] __remove_vm_area+0x2b/0x42
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800d0f60>] remove_vm_area+0x18/0x25
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800d0fb4>] __vunmap+0x47/0xed
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8819deff>] :dm_mod:ctl_ioctl+0x237/0x25b
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff80042175>] do_ioctl+0x55/0x6b
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8003018e>] vfs_ioctl+0x457/0x4b9
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8004c870>] sys_ioctl+0x59/0x78
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8005d116>] system_call+0x7e/0x83


What kernel version are you running?

Comment 8 Tanvi 2010-06-28 07:25:53 UTC
Kernel version is 2.6.18-194.3.1.el5.

Comment 9 Ben Marzinski 2010-06-30 06:29:25 UTC
Actually, the above kernel backtrace is from the waitevent loop, which shouldn't be holding any locks that keeps the checkerloop from continuing.

The checkerloop's backtrace is:

ESC[1;52rESC[51;1Hmultipathd    S ffff8100026cd0c0     0 11239      1         11
240 11238 (NOTLB)
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H ffff810062893e48 0000000000000082 0000000000000000 0000000000
000000
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H 0000000000000220 0000000000000001 ffff8100657810c0 ffff8100026cd0c0
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H 000001082878fc2c 0000000000027192 ffff8100657812a8 000000037e693000
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1HCall Trace:
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8006e1db>] do_gettimeofday+0x40/0x90
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff8005aa13>] getnstimeofday+0x10/0x28
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800637ea>] io_schedule+0x3f/0x67
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800efeb7>] sys_io_getevents+0x28c/0x350
ESC[1;51rESC[51;1H
ESC[1;52rESC[51;1H [<ffffffff800ef3c6>] timeout_func+0x0/0x10


Which is odd, since it isn't where I'd expect it to be. How locked up does this machine get when it gets locked up? and how many CPUs are in this machine?

Comment 10 Martin George 2010-06-30 12:43:49 UTC
(In reply to comment #9)
>  
> Which is odd, since it isn't where I'd expect it to be. How locked up does this
> machine get when it gets locked up? and how many CPUs are in this machine?    

The machine gets locked up for around 4 to 5 minutes during which it displays the soft lockup messages on the console. The cpuinfo is as follows:

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.318
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t
m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6400.63
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.318
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t
m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6399.42
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.318
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t
m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6399.39
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
cpu MHz         : 3200.318
cache size      : 2048 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht t
m syscall lm constant_tsc pni monitor ds_cpl cid cx16 xtpr lahf_lm
bogomips        : 6399.36
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Comment 11 Andrius Benokraitis 2010-07-01 01:43:15 UTC
Update: This issue is blocking NetApp's certification/qualification of RHEL 5.5 and all customer upgrades... correct?

Comment 12 Martin George 2010-07-01 04:40:44 UTC
(In reply to comment #11)
> Update: This issue is blocking NetApp's certification/qualification of RHEL 5.5
> and all customer upgrades... correct?    

Correct.

Comment 13 Ben Marzinski 2010-07-01 05:14:45 UTC
So, is this a regression?  Are you able to run this same test on a RHEL 5.4 system, without reproducing the error. If so, it might be useful to try to isolate the error.  If you have a RHEL 5.4 system that can't reproduce this error, it would be a big help if you were able to install the RHEL 5.5 device-mapper and device-mapper-multipath packages on it, and test to see if it reproduces with the 5.5 userspace code, and the 5.4 kernel code.

Assuming that it doesn't, it would be nice to step through some various kernel builds to figure out when the problem was introduced.

Are you able to do this?

Comment 14 Martin George 2010-07-01 06:38:23 UTC
(In reply to comment #13)
> So, is this a regression?  Are you able to run this same test on a RHEL 5.4
> system, without reproducing the error. If so, it might be useful to try to
> isolate the error.  If you have a RHEL 5.4 system that can't reproduce this
> error, it would be a big help if you were able to install the RHEL 5.5
> device-mapper and device-mapper-multipath packages on it, and test to see if it
> reproduces with the 5.5 userspace code, and the 5.4 kernel code.
> 
> Assuming that it doesn't, it would be nice to step through some various kernel
> builds to figure out when the problem was introduced.
> 
> Are you able to do this?    

It does look like a regression. Anyways, we'll try this on RHEL 5.4 as you suggested and update you with the results. 

Meanwhile do you want any more logs/dumps for analyzing this?

Comment 15 Ben Marzinski 2010-07-01 07:31:33 UTC
The problem is that from what I can tell, the soft lockup trace you're seeing is just a symptom of the problem.  At the end of the waiteventloop's DM_DEVICE_WAITEVENT ioctl, it tries to free the params. To do this it needs to tell all the CPUs to flush the pages, but apparently at least one of the CPUs is busy. Unfortunately, I don't see any indication of what that CPU is stuck on, and I don't know a good way to find out while the machine is locked up.

Comment 17 Ben Marzinski 2010-07-01 16:19:12 UTC
Actually, what would be helpful is an sosreport from a setup where you can reproduce this.  This looks like a kernel issue, but I can't say where in the kernel the problem is, so it would be helpful to see what modules are installed and being used.

Comment 18 Martin George 2010-07-02 09:43:24 UTC
Created attachment 428791 [details]
sosreport

Comment 19 Ben Marzinski 2010-07-06 19:46:10 UTC
Have you been able to try this with a 5.4 kernel and 5.5 userspace packages?

Comment 20 Martin George 2010-07-07 11:42:05 UTC
Yes, the same IO tests ran successfully on the following configs:

1) RHEL 5.4 with native packages:
   kernel-2.6.18-164.el5
   device-mapper-1.02.32-1.el5
   device-mapper-multipath-0.4.7-30.el5


2) RHEL 5.4 with 5.5 device-mapper native packages:
   kernel-2.6.18-164.el5
   device-mapper-1.02.39-1.el5
   device-mapper-multipath-0.4.7-34.el5

Comment 21 Ben Marzinski 2010-07-07 17:49:43 UTC
Just to verify that this is really a kernel bug, and not some other package, it would be nice if you could either load a RHEL 5.5 kernel onto that RHEL 5.4 machine and see if you can reproduce it, or load a RHEL 5.4 kernel onto a RHEL 5.5 machine and see if it goes away.

I'm going to try to setup a machine in-house that can reproduce this.  The debugging information I have so far does give me much clue as to what could be the root cause of the hang, but if I had to guess, I would guess that it's in the scsci layer.

If you reproduce this again, can you run sysrq-t and sysrq-w when the machine is unresponsive again.  I know that you already gave me this information once, and nothing looked amiss to me, but there seemed to be a number of missing processes in the sysrq-t output, and I'm hoping that the sysrq-w output might show something more useful this time.

Otherwise, would you be willing to try some different kernel packages, to see if we can narrow down where the regression was introduced?

If I can get a working reproducer, I'll be trying that here.

Comment 22 Ben Marzinski 2010-07-07 17:59:33 UTC
Another possibility that might help would be installing the kernel-debug rpm, and trying to reproduce this with the debug kernel, to see if we get any more useful information.

Comment 23 Barry Donahue 2010-07-07 18:12:09 UTC
(In reply to comment #22)
> Another possibility that might help would be installing the kernel-debug rpm,
> and trying to reproduce this with the debug kernel, to see if we get any more
> useful information.    

   Tom would like me to try to set up a system in house for you to use. What would be the requirements for the system? #CPU's, backend storage config etc. I should be able to cobble together something in Westford.

Comment 24 Ben Marzinski 2010-07-07 19:17:32 UTC
(In reply to comment #23)
>    Tom would like me to try to set up a system in house for you to use. What
> would be the requirements for the system? #CPU's, backend storage config etc. I
> should be able to cobble together something in Westford.    

It's a 4 CPU x86_64 machine, with 2 FC ports (either emulex or qlogic HBAs should work) connected to a 2 controller Netapp box. The node is setup with multipathed root and boot filesystems, using devices on the Netapp box.  Other than that, I just need a couple of other LUNs on the netapp box for multipath to work with. It should be installed with the latest RHEL-5.5.z code.

If you want more specifics, you should be able to get all the information you need from the sosreport.

Comment 25 Ben Marzinski 2010-07-08 03:42:47 UTC
How do you perform the controller faults in your tests.  I have a setup that is reasonably similar, and I'm going to try and reproduce this error, so any information on how to reproduce it would be helpful.

Comment 26 Martin George 2010-07-08 14:12:03 UTC
(In reply to comment #25)
> How do you perform the controller faults in your tests.  I have a setup that is
> reasonably similar, and I'm going to try and reproduce this error, so any
> information on how to reproduce it would be helpful.    

We use root on dm-multipath hosts (i.e. SANbooted) for all our tests. Run heavy IO on NetApp dm-multipathed data luns mapped to the RHEL host connected to the clustered NetApp controllers. For controller faults:

1) Run a 'cf takeover -f' on one controller head. This would takeover the partner head. Wait for 10 mts for the takeover to complete.

2) After the above 10 mts, run a 'cf giveback -f' on this controller head. This would relinquish control back to the partner head. Wait for 10 mts for the giveback to complete.

3) Now repeat the same steps on the partner head.

4) Run the above in a loop.

We also include FC switch port block/unblock for the corresponding initiators/targets as part of our tests.

For all these cases, verify that IO properly failover/failback within the stipulated limits - for our case, we set the limit to 120 seconds i.e. any failover/failback that exceeds 120 seconds is a failed test case for us.

Comment 27 Martin George 2010-07-08 14:14:56 UTC
(In reply to comment #21)
> Just to verify that this is really a kernel bug, and not some other package, it
> would be nice if you could either load a RHEL 5.5 kernel onto that RHEL 5.4
> machine and see if you can reproduce it, or load a RHEL 5.4 kernel onto a RHEL
> 5.5 machine and see if it goes away.
> 
> If you reproduce this again, can you run sysrq-t and sysrq-w when the machine
> is unresponsive again.  I know that you already gave me this information once,
> and nothing looked amiss to me, but there seemed to be a number of missing
> processes in the sysrq-t output, and I'm hoping that the sysrq-w output might
> show something more useful this time.

Will do.

> 
> Otherwise, would you be willing to try some different kernel packages, to see
> if we can narrow down where the regression was introduced?
> 

Send us the kernel packages as well. We'll try those as well.

Comment 28 Ben Marzinski 2010-07-09 14:05:34 UTC
I tried to reproduce this all yesterday and overnight on a similar setup, using the instructions from comment 26, without success.

How long does this ususally take to reproduce?

Does is usually reproduce when you're disabling the multipath paths by downing the switch ports, or by performing controller faults.  My test just did the controller takeover and giveback.

Comment 29 Ben Marzinski 2010-07-09 16:42:55 UTC
I uploaded a bunch of kernels to

http://people.redhat.com/bmarzins/kernels/

kernel-2.6.18-164.el5.x86_64.rpm is the RHEL 5.4 kernel

kernel-2.6.18-194.3.1.el5.x86_64.rpm is the RHEL 5.5 kernel that you were using

kernel-debug-2.6.18-194.3.1.el5.x86_64.rpm is the debug version of that kernel.

kernel-2.6.18-171.el5.x86_64.rpm, kernel-2.6.18-179.el5.x86_64.rpm, and kernel-2.6.18-187.el5.x86_64.rpm some other kernels to try.

First off it would be great if you could try the 5.5 kernel on a 5.4 machine, or a 5.4 kernel on a 5.5 machine, so we can verify that it really is the kernel that makes the difference.

Comment 30 Tanvi 2010-07-12 11:46:51 UTC
(In reply to comment #28)
> I tried to reproduce this all yesterday and overnight on a similar setup, using
> the instructions from comment 26, without success.
> 
> How long does this ususally take to reproduce?
> 
> Does is usually reproduce when you're disabling the multipath paths by downing
> the switch ports, or by performing controller faults.  My test just did the
> controller takeover and giveback.    

We generally hit it withing 4-5 iteration of controller fault(takeover/giveback).
Host goes to unresponsive state for 4-5 minutes only when it is performing IO.

Comment 31 Ritesh Raj Sarraf 2010-07-12 12:41:39 UTC
While the mix-match tests are running, I'd like to put my findings in.

Ben,

Do you think this can be a file system issue ?
As per the tests here internally, with file systems, the bug is triggered within hours.

I did some tests and noticed that during I/O a sync can take a very high amount of time. This is seen especially when doing buffered I/O. sync goes into stall for 20-30 minutes. This behavior is seen on the 2.6.35 rc kernels.

Please see DaveC's commit on the same.
http://lkml.org/lkml/2010/4/19/410

Comment 32 Ben Marzinski 2010-07-12 15:26:52 UTC
That may well be. My testing still hasn't hit anything all weekend, but aside from the root filesystem, all my IO was straight to the multipath devices.  I'll go look at the sosreport, and setup my devices like that.

Also, what are you using to generate the IO?

Comment 33 Ritesh Raj Sarraf 2010-07-12 17:37:30 UTC
To generate the IO, we use "dt". I believe this tool has been shared with Red Hat. 
While doing IO, we specifically disable fsync in dt with the "disable=fsync" switch. dt has the intelligence to detect IO delays. In our tests, the number of seconds is set to 120 after which dt warns/bails out if IO does not take place.

I also used fio to run some tests. When using buffered IO, if you initiate a sync, it goes into a very long wait as explained in comment #31

The settings used for fio are:
[global]
ioengine=libaio
buffered=1
rw=randwrite
bs=256k
size=2048m
directory=/tmp

[file1]
iodepth=4

[file2]
iodepth=32

[file3]
iodepth=8

[file4]
iodepth=16

I think the bug is fast triggered if (size x 4) is greater than the size of the memory on the test machine.


You can use either tool. Hopefully it should help you reproduce the bug.

Comment 34 Ben Marzinski 2010-07-12 18:18:57 UTC
Is the dt program you are using the same as the Data Test program available at
http://www.scsifaq.org/RMiller_Tools/dt.html

You are running dt on files on multipathed filesystems, correct?

What are the parameters you are running dt with?

Comment 35 Ben Marzinski 2010-07-12 19:00:16 UTC
I can't see any options to enable the dt program from http://www.scsifaq.org/RMiller_Tools/dt.html to detect IO delays.  Am I missing something, or do you use another program.  If so, can someone please point me at where I can get it?

Comment 36 Barry Donahue 2010-07-12 19:33:04 UTC
I believe the author of dt now works for NetApp. They probably have a newer version of the program.

Comment 38 Martin George 2010-07-13 11:49:39 UTC
(In reply to comment #35)
> I can't see any options to enable the dt program from
> http://www.scsifaq.org/RMiller_Tools/dt.html to detect IO delays.  Am I missing
> something, or do you use another program.  If so, can someone please point me
> at where I can get it?    

The dt utility mentioned in Robin Miller's web page seems quite old - it does not have the options to detect IO delays. I'll send an updated dt version for the same.

Comment 39 Martin George 2010-07-13 12:00:13 UTC
(In reply to comment #32)
> That may well be. My testing still hasn't hit anything all weekend, but aside
> from the root filesystem, all my IO was straight to the multipath devices. 
> I'll go look at the sosreport, and setup my devices like that.
> 

Won't the 'directio' path checker use fs requests to check a path, unlike other path checkers like 'tur'?

Comment 40 Martin George 2010-07-13 13:45:04 UTC
Created attachment 431466 [details]
dt v16.19 source tar ball

Untar the above tar ball, and then compile the dt utility as follows:

# pwd
/root/tmp/dt.d

# cd linux2.6-x86/

# make -f ../Makefile.linux VPATH=..
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dt.o ../dt.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtaio.o ../dtaio.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtfifo.o ../dtfifo.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtgen.o ../dtgen.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtinfo.o ../dtinfo.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtmmap.o ../dtmmap.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtprocs.o ../dtprocs.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtread.o ../dtread.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtwrite.o ../dtwrite.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtstats.o ../dtstats.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dttape.o ../dttape.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dttty.o ../dttty.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtutil.o ../dtutil.c
cc -O3 -DAIO -DFIFO -DMMAP -D__linux__ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64   -c -o dtusage.o ../dtusage.c
loading dt ...done

# ./dt version
    --> Date: April 3rd, 2009, Version: 16.19, Author: Robin T. Miller <--

Comment 41 Martin George 2010-07-13 14:20:49 UTC
Controller config: Use the latest DATA ONTAP v7.3.3 available on the NOW site on the NetApp controllers.

Host config: In addition to the root lun mapped from anyone of the NetApp controller heads, map 10 data luns (5 from each head) of 5G size each to the RHEL host. Ensure ALUA is enabled on the corresponding host igroup on each controller head by running "igroup set -f <igroup_name> alua yes".

On the RHEL 5.5 root on multipath host, configure dm-multipath on these 10 data luns and then create a single lv stripe (of say 45G) on all these multipath devices (total size of 50G). Mount an ext3 fs on this device.

Note - I suspect there are limitations in the ALUA hardware handler for handling port transitioning - so please turn it off in the multipath.conf i.e use the default setting of 0 itself for the hardware handler. Remaining settings as per comment #3.

Assuming the lv stripe is of 45G, run dt on this striped volume as follows:

./dt of=/mnt/tmp/TestFile iotype=random limit=10g runtime=24h enable=compare oncerr=abort errors=1 procs=4 disable=fsync oflags=trunc incr=var min=2k max=64k dispose=keep pattern=iot enable=debug enable=syslog enable=noprog noprogt=120s noprogtt=120s alarm=3s log=/var/log/dt_log disable=pstats

This would ensure 4 threads would pump IO (10G each) on the lv device. After the dt IO threads complete, check the dt logs for the results (/var/log/dt_log as given above).

During IO on the host, run forced takeover/givebacks (in a loop) on the NetApp controller heads.

Comment 42 Ben Marzinski 2010-07-13 17:49:40 UTC
(In reply to comment #39)
> (In reply to comment #32)
> > That may well be. My testing still hasn't hit anything all weekend, but aside
> > from the root filesystem, all my IO was straight to the multipath devices. 
> > I'll go look at the sosreport, and setup my devices like that.
> > 
> 
> Won't the 'directio' path checker use fs requests to check a path, unlike other
> path checkers like 'tur'?    

Yes, it does regular IO, but it does it to the block device, not to a filesystem on top of the block device.  Also, I was only thinking about the IO I was running to put the machine under load.

However, I haven't had any luck reproducing it with my own IO loads using a filesystem setup similar to your sosreport's.  I'm going to try with dt now.

Comment 43 Ritesh Raj Sarraf 2010-07-15 14:23:48 UTC
Ben,

There's this other bug that you own, https://bugzilla.redhat.com/show_bug.cgi?id=606801, with similar behavior.

We have run some tests (both take/give and port offline) cutting down the file system and the problem is not seen yet. Been running successfully for 32 hrs.

The tests are; running dt on top of multipathed devices and triggering faults on the NetApp target.

Will let you know how the full 48hrs cycle goes.

Comment 44 Ritesh Raj Sarraf 2010-07-15 14:24:47 UTC
And this test is on iSCSI.

(In reply to comment #43)
> Ben,
> 
> There's this other bug that you own,
> https://bugzilla.redhat.com/show_bug.cgi?id=606801, with similar behavior.
> 
> We have run some tests (both take/give and port offline) cutting down the file
> system and the problem is not seen yet. Been running successfully for 32 hrs.
> 
> The tests are; running dt on top of multipathed devices and triggering faults
> on the NetApp target.
> 
> Will let you know how the full 48hrs cycle goes.

Comment 45 Ben Marzinski 2010-07-16 01:43:18 UTC
I have a system setup pretty much identical to the specifications in comment 41, and I've been running the tests on it for a while now.  I still cannot reproduce this issue at all.  Have you been able to try some of the other kernels?

Comment 46 Ritesh Raj Sarraf 2010-07-16 15:39:55 UTC
(In reply to comment #43)
> The tests are; running dt on top of multipathed devices and triggering faults
> on the NetApp target.
> 
> Will let you know how the full 48hrs cycle goes.    

The 48hrs tests completed successfully and there was no delayed IO seen (the setup as described in Comment #43)

Comment 47 Martin George 2010-07-20 10:43:55 UTC
(In reply to comment #46)

> The 48hrs tests completed successfully and there was no delayed IO seen (the
> setup as described in Comment #43)    

Actually the above 48 hour test which completed successfully as mentioned above was on a iSCSI setup with fs disabled.

Comment 48 Martin George 2010-07-20 11:15:39 UTC
Ben,

We've hit the delayed IO consistently on a FC setup with the ALUA handler enabled & with fs on the multipathed devices. But we are yet to verify whether this is seen without the handler or without the fs.

We support implicit ALUA alone with the following valid states:

sd 0:0:1:0: alua: port group 00 state A supports ToUsNA

But the ALUA handler has the following code snippet:

if (h->tpgs & TPGS_MODE_EXPLICIT) {
        switch (h->state) {
        case TPGS_STATE_TRANSITIONING:
                /* State transition, retry */
                goto retry;
                break;
        case TPGS_STATE_OFFLINE:
                /* Path is offline, fail */
                err = SCSI_DH_DEV_OFFLINED;
                break;
        default:
                break;
        }
} else {
        /* Only Implicit ALUA support */
        if (h->state == TPGS_STATE_OPTIMIZED ||
            h->state == TPGS_STATE_NONOPTIMIZED ||
            h->state == TPGS_STATE_STANDBY)
                /* Useable path if active */
                err = SCSI_DH_OK;
        else
                /* Path unuseable for unavailable/offline */
                err = SCSI_DH_DEV_OFFLINED;
}

During controller faults, the lun is in 'transitioning' state. But from the above code, it seems this is handled for explicit ALUA alone i.e. the ALUA handler does not treat 'transitioning' properly.

Do you think this could have contributed to the delayed IO?

Comment 50 Ben Marzinski 2010-07-23 19:17:05 UTC
(In reply to comment #48)
> During controller faults, the lun is in 'transitioning' state. But from the
> above code, it seems this is handled for explicit ALUA alone i.e. the ALUA
> handler does not treat 'transitioning' properly.
> 
> Do you think this could have contributed to the delayed IO?    


It's quite possible that this is related. Have you been able to try this without alua enabled?

Comment 51 Ben Marzinski 2010-07-23 20:17:46 UTC
But I thought in your multipath.conf, you disabled the alua hardware handler, like you mentioned in comment #41.  Does the hardware handler code still get called if you have don't enable it in multipath.conf?

Comment 52 Rajashekhar M A 2010-07-26 10:17:32 UTC
Created attachment 434375 [details]
Script to start IO and run faults on netapp controllers

Ben,

We haven't hit the issue without alua enabled.

We are hitting this issue with 8G initiators (QLogic and Emulex) with ALUA enabled on both host and controllers and hardware_handler set to "0". Here are the steps for recreating -

1. Map 40 LUNs, 20 from each head.
2. Create a VG on top of all 40 LUNs.
3. Create 3 LVs striped across all 40 LUNs and mount them.
4. Copy the dt tool to /sbin/dt and provide exec permission.
5. Copy the attached script to the host at /root/fault.pl
6. Start the script and enter the details when asked.
7. Verify that the IO is happening on the devices.

The script will log the status and dt logs to /root/io_log/ directory. You can monitor the script status though /root/io_log/Status file. The Status file will have logs about IO delays, if the issue gets reproduced. Once the issue is seen, please kill the script.

Comment 53 Andrius Benokraitis 2010-07-26 16:11:25 UTC
The 3040 we have in Westford has 2Gb adapters in it currently - would that be sufficient to test this on the target side?

Comment 54 Rajashekhar M A 2010-07-27 07:52:00 UTC
Created attachment 434617 [details]
FC setup details

We have seen this issue with both 4Gb and 8Gb HBAs on the controllers. We have not yet tested the same with 2Gb. But you can surely give a try. Attached is the doc having FC setup details.

The FAS3040 controllers may have on-board 4Gb adapters. Please run the command "sysconfig -av 0" to check if the on-board cards support 4Gb speed.

Comment 55 Martin George 2010-07-27 14:46:08 UTC
(In reply to comment #50)
> It's quite possible that this is related. Have you been able to try this
> without alua enabled?    

Well, it seems the delay is hit only in ALUA setups as mentioned by Raj in comment #52.

Comment 56 Martin George 2010-07-27 14:50:56 UTC
(In reply to comment #51)
> But I thought in your multipath.conf, you disabled the alua hardware handler,
> like you mentioned in comment #41.  Does the hardware handler code still get
> called if you have don't enable it in multipath.conf?    

No, I don't think so. Though the scsi_dh_alua is always loaded during bootup, multipath -ll still lists the hw_handler as '0' (if disabled in the multipath.conf). So it does not seem the handler gets called.

However since this is seen with ALUA alone, I suspect the kernel ALUA state transition handling itself to be at fault here which may have triggered the delay.

Comment 57 Martin George 2010-07-28 14:43:49 UTC
The following messages are seen in the /var/log/messages during the controller faults:

kernel: sd 0:0:1:0: timing out command, waited 360s
kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready
kernel:     Add. Sense: Logical unit not accessible, asymmetric access state transition

This goes back to bug 559586. During NetApp controller faults on ALUA enabled setups, the target replies with a NOT READY status and an ASC/ASCQ combination of LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION. And this SCSI error code is not handled in scsi_error.c, but in the ALUA handler alone. But the ALUA handler itself has got issues handling implicit ALUA port transitioning as mentioned in comment #48 - that's why we disabled it.

So we retried the tests after patching the kernel as follows (same patch mentioned in the above bug):

diff -urpN kernel-2.6.18/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c kernel-2.6.18-new/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c
--- kernel-2.6.18/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c 2010-07-27 12:46:52.000000000 +0530
+++ kernel-2.6.18-new/linux-2.6.18.x86_64/drivers/scsi/scsi_error.c     2010-07-27 12:56:46.000000000 +0530
@@ -365,7 +365,8 @@ static int scsi_check_sense(struct scsi_
                 * if the device is in the process of becoming ready, we
                 * should retry.
                 */
-               if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01))
+               if ((sshdr.asc == 0x04) &&
+                   (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a))
                        return SCSI_MLQUEUE_DIS_DEV_RETRY;
                /*
                 * if the device is not started, we need to wake

And initial tests have looked promising so far - IO delays have reduced considerably.

Comment 58 Martin George 2010-07-29 12:14:23 UTC
Adding Mike Christie who was involved in bug 559586.

Comment 59 Martin George 2010-07-29 12:19:05 UTC
Filed bug 619361 for the 'transitioning' issue in the SCSI ALUA handler. 

So perhaps to address the delayed IO here, we may have to get the above patch mentioned in comment #57 into the RHEL kernel or even better - get bug 619361 fixed & then start using the SCSI ALUA handler.

Comment 60 Mike Snitzer 2010-07-29 19:03:06 UTC
(In reply to comment #56)
> (In reply to comment #51)
> > But I thought in your multipath.conf, you disabled the alua hardware handler,
> > like you mentioned in comment #41.  Does the hardware handler code still get
> > called if you have don't enable it in multipath.conf?    
> 
> No, I don't think so. Though the scsi_dh_alua is always loaded during bootup,
> multipath -ll still lists the hw_handler as '0' (if disabled in the
> multipath.conf). So it does not seem the handler gets called.

scsi_dh_alua will not even get loaded if multipath.conf isn't configured to use alua: hardware_handler "1 alua"

That said, it may be autoloading for you via the initrd (for use with mpath root device).  Certainly a potential pitfall for systems that aren't using mpath root: seems we can't rely on scsi_dh_alua being available.

Comment 61 Mike Snitzer 2010-07-29 20:38:56 UTC
(In reply to comment #59)
> Filed bug 619361 for the 'transitioning' issue in the SCSI ALUA handler. 
> 
> So perhaps to address the delayed IO here, we may have to get the above patch
> mentioned in comment #57 into the RHEL kernel or even better - get bug 619361
> fixed & then start using the SCSI ALUA handler.    

Right, I agree with the "even better" approach of fixing bug#619361.

The patch from comment#57 is certainly insightful but handling an ALUA specific condition in the SCSI midlayer's error handling seems misplaced.

But its always nice to have a known workaround.

Comment 62 Mike Christie 2010-07-29 22:29:44 UTC
(In reply to comment #57)
> The following messages are seen in the /var/log/messages during the controller
> faults:
> 
> kernel: sd 0:0:1:0: timing out command, waited 360s
> kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready
> kernel:     Add. Sense: Logical unit not accessible, asymmetric access state
> transition

So with scsi_dh_alua loaded it looks like we retried the IO for 6 minutes but kept getting that the dev is transitioning????

> 
> And initial tests have looked promising so far - IO delays have reduced
> considerably.    

With just your scsi_check_sense patch and not scsi_dh_alua do you still see
kernel: sd 0:0:1:0: Device not ready: <6>: Current: sense key: Not Ready
kernel:     Add. Sense: Logical unit not accessible, asymmetric access state
transition

(still see the not ready error but no longer see the timing out messages before it)


MikeS pointed that scsi_dh_lua uses SCSI_MLQUEUE_IMM_RETRY and your patch uses SCSI_MLQUEUE_DIS_DEV_RETRY.

The difference is that SCSI_MLQUEUE_IMM_RETRY retries for the max command runtime of retries * cmd_timout (see drivers/scsi/scsi_lib.c:scsi_attempt_requeue_command for that timing out command message). And SCSI_MLQUEUE_DIS_DEV_RETRY retries only 5 times. So if scsi-ml is handling the retries right for IMM_RETRY your patch should only be failing the IO faster.

Comment 63 Rajashekhar M A 2010-07-30 10:48:17 UTC
Created attachment 435525 [details]
Script to start IO and run faults on netapp controllers

Please use this script for starting the tests. There was an issue with an argument for dt in the previous script.

Comment 64 Mike Snitzer 2010-07-31 03:42:36 UTC
I'm fairly certain I've reproduced this bug.  Interestingly, I did so _without_ any heavy IO from dt (in fact I wasn't doing any IO aside from regularly running multipath -ll).

I started by adding scsi_dh_alua debugging to the latest 5.5.z stream kernel, that kernel is available here:
http://people.redhat.com/msnitzer/RPMS/bz606259/kernel-2.6.18-194.11.1.el5.alua_dbg.x86_64.rpm

I then proceeded to do some basic failure tests while connected to the first NetApp controller:
1) fcp stop
2) fcp start

I issued multipath -ll throughout to collect alua_check_sense data.

When all paths were back and the sense_data and scsi errors stopped I proceeded to fail over the 2nd controller to the 1st with:
1) cf takeover
2) cf giveback

During giveback I noticed that multipath -ll stalled just after having printed various checker down messages:
sdaa: checker msg is "directio checker reports path is down"
sdab: checker msg is "directio checker reports path is down"
sdac: checker msg is "directio checker reports path is down"
...

I then saw a tremendous amount of NOT_READY/0x04/0x0a messages (from the debugging I added to scsi_dh_alua's alua_check_sense):

sd 1:0:3:4: alua_check_sense: NOT_READY/0x04/0x0a
sd 1:0:3:8: alua_check_sense: NOT_READY/0x04/0x0a
sd 1:0:3:9: alua_check_sense: NOT_READY/0x04/0x0a
...

Total duration of this NOT_READY/0x04/0x0a flood was ~8 minutes:
Jul 30 23:09:39 pnate kernel: sd 1:0:3:0: alua_check_sense: NOT_READY/0x04/0x0a
...
Jul 30 23:17:08 pnate kernel: sd 1:0:3:6: alua_check_sense: NOT_READY/0x04/0x0a

While this was occurring the NetApp controller did not appear to be happy (note that the netapp timestamps are exactly an hour behind the Linux host):

na3040a(takeover)> cf giveback
Fri Jul 30 22:09:36 EST [na3040a (takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator
na3040a(takeover)> Fri Jul 30 22:09:36 EST [na3040a: cf.fm.givebackStarted:warning]: Cluster monitor: giveback started
Fri Jul 30 22:09:37 EST [na3040b/na3040a: iscsi.service.shutdown:info]: iSCSI service shutdown
Fri Jul 30 22:09:37 EST [na3040b/na3040a: fcp.service.shutdown:info]: FCP service shutdown
Fri Jul 30 22:09:37 EST [na3040a: netif.linkInfo:info]: Ethernet e0c: Link configured down.
Fri Jul 30 22:09:37 EST [na3040a: cf.rsrc.transitTime:notice]: Top Giveback transit times wafl=400 {sync_clean=156, giveback_sync=125, forget=106, finish=12, vol_refs=1, mark_abort=0, wait_offline=0, wait_create=0, abort_scans=0, drain_msgs=0}, wafl_gb_sync=190, raid=129, ndmpd=91, sanown_replay=32, registry_giveback=24, nfsd=11, java=5, vdisk=3, ems=2
Fri Jul 30 22:09:37 EST [na3040a: asup.msg.giveback.delayed:info]: giveback AutoSupport delayed 5 minutes (until after the giveback process is complete).
Fri Jul 30 22:09:37 EST [na3040a: cf.fm.givebackComplete:warning]: Cluster monitor: giveback completed
Fri Jul 30 22:09:37 EST [na3040a: cf.fm.givebackDuration:warning]: Cluster monitor: giveback duration time is 1 seconds
Fri Jul 30 22:09:37 EST [na3040a: cf.fsm.stateTransit:warning]: Cluster monitor: TAKEOVER --> UP
Fri Jul 30 22:09:38 EST [na3040a: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of na3040a by na3040b disabled (unsynchronized log)
Fri Jul 30 22:09:39 EST [na3040a: cf.fm.timeMasterStatus:info]: Acting as cluster time slave
Fri Jul 30 22:09:40 EST [na3040a: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of na3040b disabled (partner booting)
Fri Jul 30 22:09:50 EST [na3040a: cf.partner.short_uptime:warning]: Partner up for 3 seconds only
Fri Jul 30 22:09:51 EST [na3040a: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of na3040b disabled (unsynchronized log)
Fri Jul 30 22:10:50 EST [na3040a: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #5 SEND_DESC_ERROR 2
Fri Jul 30 22:10:50 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:10:54 EST [na3040a: cf.rv.notConnected:error]: Connection for cfo_rv failed
Fri Jul 30 22:11:00 EST [na3040a: cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP
Fri Jul 30 22:12:03 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:13:17 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:14:30 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout

na3040a>
na3040a>
na3040a>
na3040a> Fri Jul 30 22:16:23 EST last message repeated 2 times
Fri Jul 30 22:17:04 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:17:45 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:18:23 EST [na3040a: asup.post.sent:notice]: Cluster Notification message posted to NetApp: Cluster Notification from na3040a (CLUSTER GIVEBACK COMPLETE) INFO
Fri Jul 30 22:18:26 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
Fri Jul 30 22:19:08 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous connection timeout
    Fri Jul 30 22:19:47 EST [na3040a: cf.rv.nicReset:error]: Reset cluster interconnect(s) due to unsynchronized log
Fri Jul 30 22:19:47 EST [na3040a: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to rendezvous fm timeout
Fri Jul 30 22:19:47 EST [na3040a: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Fri Jul 30 22:19:56 EST [na3040a: cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP
Fri Jul 30 22:19:58 EST [na3040a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of na3040b enabled
Fri Jul 30 22:19:58 EST [na3040a: cf.fsm.takeoverByPartnerEnabled:notice]: Cluster monitor: takeover of na3040a by na3040b enabled
Fri Jul 30 22:20:06 EST [na3040a: monitor.globalStatus.ok:info]: The system's global status is normal.

Comment 65 Mike Snitzer 2010-07-31 04:22:17 UTC
Created attachment 435721 [details]
log that shows netapp commands and output

Full netapp log from when I triggered the faults on the netapp controller that I discussed in comment#64

Comment 66 Mike Snitzer 2010-07-31 04:24:40 UTC
Created attachment 435722 [details]
gzip'd messages during fcp stop/start

Comment 67 Mike Snitzer 2010-07-31 04:25:49 UTC
Created attachment 435723 [details]
gzip'd messages during cf takeover/givenback

Comment 68 Mike Snitzer 2010-08-02 16:33:14 UTC
(In reply to comment #64)
> I'm fairly certain I've reproduced this bug.  Interestingly, I did so _without_
> any heavy IO from dt (in fact I wasn't doing any IO aside from regularly
> running multipath -ll).

Ben clarified that multipath -ll will hang waiting for all recovering paths.  So I clearly need to have active IO (via dt, etc) to see if the active paths are still responsive even though the one controller is taking 7-8 minutes to recover (and the giveback to complete).

> I started by adding scsi_dh_alua debugging to the latest 5.5.z stream kernel,
> that kernel is available here:
> http://people.redhat.com/msnitzer/RPMS/bz606259/kernel-2.6.18-194.11.1.el5.alua_dbg.x86_64.rpm

In parallel to my continued testing it would be great if NetApp could try reproducing with this "2.6.18-194.11.1.el5.alua_dbg" kernel.  There will be a flood of scsi_dh_alua debugging but it should help shed light on whether the kernel is seeing LUNs in the "NOT_READY/0x04/0x0a" state for extended periods and is properly retrying as required.

> Total duration of this NOT_READY/0x04/0x0a flood was ~8 minutes:
> Jul 30 23:09:39 pnate kernel: sd 1:0:3:0: alua_check_sense: NOT_READY/0x04/0x0a
> ...
> Jul 30 23:17:08 pnate kernel: sd 1:0:3:6: alua_check_sense: NOT_READY/0x04/0x0a
> 
> While this was occurring the NetApp controller did not appear to be happy 

We'd like to understand if this is a known issue with the NetApp controller (and the motivation for waiting 10 minutes between takeover and giveback).

Comment 69 Mike Christie 2010-08-02 20:12:40 UTC
(In reply to comment #64)
> During giveback I noticed that multipath -ll stalled just after having printed
> various checker down messages:
> sdaa: checker msg is "directio checker reports path is down"
> sdab: checker msg is "directio checker reports path is down"
> sdac: checker msg is "directio checker reports path is down"

I think the use of the directio path checker could cause problems. IOs like a READ can cause path transitions with implicit failover. Does using a TUR for the path checker help?

Comment 70 Mike Snitzer 2010-08-02 22:46:29 UTC
(In reply to comment #69)
> (In reply to comment #64)
> > During giveback I noticed that multipath -ll stalled just after having printed
> > various checker down messages:
> > sdaa: checker msg is "directio checker reports path is down"
> > sdab: checker msg is "directio checker reports path is down"
> > sdac: checker msg is "directio checker reports path is down"
> 
> I think the use of the directio path checker could cause problems. IOs like a
> READ can cause path transitions with implicit failover. Does using a TUR for
> the path checker help?    

I'll keep your suggestion in mind but I can no longer reproduce the long 7-8 minute giveback that resulted in a NOT_READY/0x04/0x0a flood.  So I don't know whether or not directio was inducing the netapp controller failures during giveback.

Comment 71 Mike Snitzer 2010-08-06 02:13:32 UTC
I have been running the takeover/giveback load test script from comment#63 for nearly 30 hours.  (I modified the script slightly to a single dt instance that uses 4 dt processes each doing 5G of IO).

I have not been able to reproduce this delayed IO issue.

Comment 72 Martin George 2010-08-06 12:31:42 UTC
We are currently running with the ALUA debug kernel and will update you with the results.

Comment 73 Martin George 2010-08-09 19:56:38 UTC
(In reply to comment #69)
> 
> I think the use of the directio path checker could cause problems. IOs like a
> READ can cause path transitions with implicit failover. Does using a TUR for
> the path checker help?    

So are you saying there are issues with directio compared to other checkers like tur? We switched to directio because of bug 247443.

Comment 74 Mike Christie 2010-08-10 06:44:33 UTC
(In reply to comment #73)
> (In reply to comment #69)
> > 
> > I think the use of the directio path checker could cause problems. IOs like a
> > READ can cause path transitions with implicit failover. Does using a TUR for
> > the path checker help?    
> 
> So are you saying there are issues with directio compared to other checkers
> like tur? We switched to directio because of bug 247443.    

directio does READ IO right? So with implicit failover, if dm-multipath is trying to use the paths in the secondary path group (it is sending READs/WRITES), but the path checker is sending IO to the primary paths could the IO to both groups cause each other to transition. So we end up ping ponging and each group keeps causing the other to group transition?

Or will it take X IOs before a transition occurs?

Comment 75 Ben Marzinski 2010-09-01 14:43:08 UTC
You are correct that the directio checker sends IO to all the paths to verify that they are active.  I have no idea if this could cause a ping-pong, but if you can't reproduce this on earlier releases it seems less likely, since there hasn't been any change to how the directio code works. But just to confirm, with implicit ALUA, will you have a problem with IO going to the non-optimal paths?

Comment 76 Ben Marzinski 2010-09-27 18:04:22 UTC
What are your results from running with the debug kernel?

Comment 77 Andrius Benokraitis 2010-10-07 20:18:14 UTC
(In reply to comment #76)
> What are your results from running with the debug kernel?

Martin - any progress based on the comments?

Comment 79 Mike Snitzer 2010-11-30 19:26:06 UTC
AFAIK NetApp's testing with their latest firmware and the transitioning patch from Bug# 619361 has proven stable.  As such I believe this BZ has been resolved indirectly.

*** This bug has been marked as a duplicate of bug 619361 ***

Comment 80 Marc Grimme 2011-05-11 11:24:51 UTC
Sorry to ask on an already closed bug.

But I observed a similar behaviour with RHEL5.6 (2.6.18-238.9.1.el5) root on NetApp and 50 Luns on same Netapp. Failover is configured via ALUA.

While making a controller takeover the machine becomes unresponsive for about 10 Minutes.

Can anybody confirm that this bug is not been seen since the release of 5.6?

Thanks for you help.
Marc.


Note You need to log in before you can comment on or make changes to this bug.