Bug 676123

Summary: multipath returns before all device nodes exist
Product: Red Hat Enterprise Linux 5 Reporter: James Ralston <ralston>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.6CC: agk, bmarzins, bmr, bturner, christophe.varoqui, dwysocha, heinzm, junichi.nomura, kueda, lmb, mbroz, prajnoha, prockai
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-12 17:42:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
add a delay to rc.sysinit to work around the bug none

Description James Ralston 2011-02-08 21:29:18 UTC
We have multiple hosts with QLogic ISP4032 iSCSI cards in them. For each host, we configure multiple paths to the SAN, and then use multipath to present a single logical device.  E.g.:

mpath1 (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) dm-9 EQLOGIC,100E-00
[size=8.0T][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 1:0:2:0 sdb 8:16  [active][undef]
 \_ 2:0:2:0 sdc 8:32  [active][undef]

We then create a logical volume group on the multipath device (/dev/dm-9, in this case). This has worked fine for (literally) years.

But just recently, our hosts that run multipath have begun to fail to reboot. This happens because the device nodes for the multipath-backed filesystems haven't been created yet when rc.sysinit calls lvm.static:

Setting up Logical Volume Management:   File-based locking initialisation failed
.
  Couldn't find device with uuid xxxxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxxxxx.
  Refusing activation of partial LV f0. Use --partial to override.
  4 logical volume(s) in volume group "example0" now active
  6 logical volume(s) in volume group "example1" now active
[FAILED]

This causes fsck to bomb out to single-user mode.

Through experimentation, we have discovered that if we add a delay to rc.sysinit between the call to /sbin/multipath.static and the call to /sbin/lvm.static, then the multipath device nodes will be present when /sbin/lvm.static runs, which means that all volume groups will initialize properly, and the boot will not fail.

So clearly, something has recently changed in the behavior of multipath, udev, and/or the kernel. It's an unfortunate interaction issue, but since it prevents hosts that have filesystems backed by multipath from successfully rebooting, this problem needs to be corrected immediately.

Versions:

0:device-mapper-1.02.55-2.el5.x86_64
0:device-mapper-event-1.02.55-2.el5.x86_64
0:device-mapper-multipath-0.4.7-42.el5.x86_64
0:kernel-2.6.18-238.1.1.el5.x86_64
0:udev-095-14.24.el5.x86_64

Comment 1 James Ralston 2011-02-08 21:30:33 UTC
Created attachment 477694 [details]
add a delay to rc.sysinit to work around the bug

This is how we've worked around the problem for now.

Comment 2 James Ralston 2011-02-08 21:39:05 UTC
Cross-filed as Red Hat Support Case 00416066.

Comment 4 Alasdair Kergon 2011-02-08 23:46:41 UTC
"We then create a logical volume group on the multipath device (/dev/dm-9, in
this case)."

Is there a reason you're using 'dm-9' (under async control of udev) there rather than /dev/mpath/<something> (under synchronous multipath control) ?

Comment 5 Alasdair Kergon 2011-02-08 23:53:58 UTC
Could try 'udevsettle' instead of the sleep, or check for any refs (e.g. in lvm.conf filters) to /dev/dm-* and replace with /dev/mapper or /dev/mpath.

Comment 6 Bryn M. Reeves 2011-02-09 10:51:44 UTC
The /dev/mpath symlinks are also controlled by udev in RHEL5. Afaik only the /dev/mapper device nodes are created synchronously.

What filters are in use in lvm.conf?

Comment 7 James Ralston 2011-02-09 17:59:38 UTC
This is the typical lvm.conf filter we use on our multipath hosts:

filter = [ "a|/dev/sda|", "a|^/dev/dm|", "r/.*/" ]

The reason why it's so restrictive is because we can't let LVM glom onto the block devices that correspond to the physical paths to the SAN; we need LVM to find only the multipath block device.

But based on your comments, what it sounds like is that udev is responsible for (asynchronously) creating the /dev/dm* devices, after multipath (synchronously) creates the /dev/mapper/mpath* devices. Which means that we can eliminate the race condition in rc.sysinit by changing our lvm.conf filter specification to:

filter = [ "a|/dev/sda|", "a|^/dev/mapper/|", "r/.*/" ]

Have I understood things correctly?

Comment 8 Bryn M. Reeves 2011-02-09 18:19:14 UTC
Exactly; this is the filter style that I normally recommend for environments like yours.

"Accept the things you explicitly want to be used as PVs and reject everything else."

You can use udevsettle here if you want to sync up with the creation of the dm-N nodes but I don't personally recommend it - allowing LVM2 to scan into /dev/mapper and use those nodes is more robust because they are created synchronously wrt the devices.

Comment 9 James Ralston 2011-02-09 20:49:27 UTC
Confirmed; the system boots as expected (without needing sleep/udevsync) if LVM looks for /dev/mapper/* nodes instead of /dev/dm-* nodes.

As it turns out, udev creates /dev/dm-9 only ~1.2 seconds after multipath creates /dev/mapper/mpath1:

$ stat /dev/mapper/mpath1 /dev/dm-9 | grep Change
Change: 2011-02-09 15:19:25.254970688 -0500
Change: 2011-02-09 15:19:26.489970688 -0500

...but as we've seen, rc.sysinit calls lvm quickly enough after calling multipath to beat that window.

I see now that the RHEL5 DM Multipath guide (section 2.1, 
"Multipath Device Identifiers") gives explicit advice to prefer the /dev/mapper/mpath* devices over /dev/mpath/mpath* and /dev/dm-*, explicitly for this reason. So, this is our fault for not following that advice. Thanks for setting us straight.

Please close as NOTABUG (or appropriate), as I don't seem to be able to close this ticket myself.