Bug 309891 - Improve support for RDAC storage to dm-multipath - MD3000 failback fails
Summary: Improve support for RDAC storage to dm-multipath - MD3000 failback fails
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath   
(Show other bugs)
Version: 5.1
Hardware: All Linux
Target Milestone: ---
: ---
Assignee: Ben Marzinski
QA Contact: Corey Marthaler
: 307151 (view as bug list)
Depends On: 248931
TreeView+ depends on / blocked
Reported: 2007-09-27 19:48 UTC by Ben Marzinski
Modified: 2010-01-12 02:39 UTC (History)
31 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-12-07 23:12:52 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
serial console output of a Dell PE2950 booting with 20 LUNs attached to an MD3000 with redundant paths (64.46 KB, text/plain)
2007-10-19 21:54 UTC, Vinny Valdez
no flags Details

Comment 1 Ben Marzinski 2007-09-27 19:50:29 UTC
Since code went into 5.1. from bz #248931, this clone should be used for the
continuing issues.

Comment 2 Ben Marzinski 2007-10-17 20:00:36 UTC
Uploading the console output would be useful.

Do you know what programs are trying to access the private paths. If it's only
LVM, you should try filtering out the devices in /etc/lvm/lvm.conf.  This will
keep lvm from trying to scan them, which is what you want, since they are owned
by the multipath device anyway.

If programs are accessing the passive path directly (for example by reading from
/dev/sdX, where /dev/sdX is a passive path) there is no way that
device-mapper-multipath can stop them. LVM does this, but you can add filters to
avoid it.  If there are other programs that do it, they hopefully have a method
for filtering devices as well.

If the accesses are coming from multipath, it's either a configuration problem,
or a bug in the code.

Comment 3 Vinny Valdez 2007-10-17 22:07:16 UTC
I will do that and upload the output tomorrow.

As far as LVM, this happens even with unconfigured LUNs before creating any
volumes on them.  However, I see what you are getting at, since LVM will scan
all block devices looking for LVMs.  I tried this:

  filter = [ "a/sda/", "r/.*/" ]

But the behavior is the same.

Comment 4 Ben Marzinski 2007-10-18 16:05:36 UTC
Hmm. If it's not LVM, do have any idea what programs ARE sending IO to the
passive paths?  To check if it's multipathd (which would point to a bad checker
function.. either misconfigured or buggy) you can run

# multipathd -k

and then repeating run (you can scroll through the command history with the
arrow keys)

> show paths

while watching

# tail -f /var/log/messages

This way, you should be able to see if the error messages from the paths
coincide with the path checker running on them.  Use ctrl-d to exit the
interactive multipathd shell.

Also, instead of removing the device-mapper-multipath module, you can just
blacklist all your devices

blacklist {
        devnode "*"

This will let you know if you can see these errors even without any multipathed
devices running.

Also, just to double-check that LVM scanning isn't causing these errors, you
can run

# lvscan -vv

And check to see if it causes any errors, and also check to make sure that the
passive paths aren't listed in the list of paths that it checks.

Comment 6 Vinny Valdez 2007-10-19 21:51:32 UTC
It looks like the LVM filter did in fact solve part of this problem, but not all
of it.  When I tested the filter above with 20 LUNs earlier, I timed the boot
like usual, and when it started spewing the same errors and was into 10 minutes
of boot time I wrote it off as the same behavior.

However, on further inspection, it looks like the filter actually did solve a
lot of issues.  Previously, lvscan, pvs, pvcreate, and other commands would
generate these errors, but now they don't.  Boot time with 20 LUNs was 13
minutes instead of 17.

So it seems there are still two other spots that the boot process hangs: 1. udev
and 2. haldaemon.  

Udev gives errors during "Staring udev" which causes boot time relative to the
number of attached LUN.

But I think the worst offender is haldaemon.  I turned off this service, and
rebooted my system.  Boot time with 20 LUNs was just under 6 minutes.

According to an article I read online, "HAL is used for discovering storage,
networking <snip>".  So this seems to be causing the most amount of boot time. 
After I get a login screen, if I run "service haldaemon restart" it will print
tons of errors and be unavailable until complete.

Do I need to open a bug under haldaemon?  What happens on other types of shared
storage such as EMC or iSCSI and 20 LUNs?

fdisk -l generates the same errors as before, but that is because it is trying
to list the partition table on each device, including the passive ones.

I turned on multipathd and modified the filter to:

  filter = [ "a/sda/", "a|mapper|", "r/.*/" ]

And access to the dm device worked fine.

I will attach a console log of the system booting with 20 LUNs.  I added a
$(date) statement at the top and bottom of rc.sysinit, which gives a good idea
of elapsed time. Also, there are so many paths that the devices were not
filtered out that started at /dev/sda, I need to find a better way to not filter
out the local disk.

Comment 7 Vinny Valdez 2007-10-19 21:53:12 UTC
Also, everything above was done with all devices blacklisted in multipath.conf
and multipathd off

Comment 8 Vinny Valdez 2007-10-19 21:54:37 UTC
Created attachment 233141 [details]
serial console output of a Dell PE2950 booting with 20 LUNs attached to an MD3000 with redundant paths

Comment 9 Ben Marzinski 2007-10-26 20:32:49 UTC
*** Bug 307151 has been marked as a duplicate of this bug. ***

Comment 10 Ben Marzinski 2007-11-20 21:54:52 UTC
To filter the your sda[a-z] devices, you should just be able to use a filter
line like:

filter = [ "a/sda$/", "a|mapper|", "r/.*/" ]

I couldn't find any straightforward way to filter the haldaemon, but that
doesn't mean that there isn't one.  You should probably open a bugzilla against
hal. Either that or you can just change the component of this bug to hal, if you
don't have any more multipath specific issues.

Comment 11 Ben Marzinski 2007-12-07 18:25:54 UTC
Are there any more multipath issues related to this bug?  Otherwise I will close

Comment 12 Vinny Valdez 2007-12-07 18:39:43 UTC
Dell decided to go with LSI's MPP driver, so I have not been able to test this
out lately.  I will open a bug with haldaemon about this.  This bug can be closed.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.