Bug 705430 - [NetApp CQ179680] [RH Support case 00472442] Device Mapper Multipath creates multipath device for internal hard disk
Summary: [NetApp CQ179680] [RH Support case 00472442] Device Mapper Multipath creates ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-17 16:23 UTC by Sean Stewart
Modified: 2012-03-19 18:58 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-03-19 18:58:50 UTC
Target Upstream Version:


Attachments (Terms of Use)
multipath.conf from partner system (3.61 KB, text/plain)
2011-05-17 18:24 UTC, Bryn M. Reeves
no flags Details
multipath -v4 -ll from partner system (91.24 KB, text/plain)
2011-05-17 18:24 UTC, Bryn M. Reeves
no flags Details
initramfs that doesn't allow the system to boot (6.56 MB, application/octet-stream)
2011-05-18 19:56 UTC, Sean Stewart
no flags Details

Description Sean Stewart 2011-05-17 16:23:34 UTC
This bugzilla is for support case 00472442
I am entering this bugzilla to see if I can get some others to look at this before the OS GA's.

Details of the problem:
I fresh installed RHEL 6 SP 1 on a host with a single internal hard disk, without multipath enabled. After this, I set up multipath to start on the host.  Before the activation of multipath, the output of "vgs --options vg_name,pv_name" shows that the root lun is on disk /dev/sda2.  After activating multipath, the internal disk gets virtualized into /dev/dm-0, and the output of the vgs command shows that the root lun volume is comprised of the multipath device.  If I explicitly blacklist the internal disk, a message will show that the device is blacklisted, but the multipath device will be created anyway.  If I chkconfig multipathd off and reboot, multipath will activate and virtualize the internal disk, anyway.

Configuration Details:
This has been observed on at least five different servers with different branded internal hard disks. Any attempt to prevent the internal disk from being virtualized has failed.

Steps to Reproduce:
1. Fresh install RHEL6.1 RC2
2. Set up multipath.conf, activate multipath, reboot

Expected Results:
After activating multipath, it should virtualize the external storage disks with vendor/product id:: LSI INF-01-00, and not the internal disk. It should ignore the internal disk based on the blacklist rules.

Actual Results:
A multipath device is created from the internal disk.  If user-friendly names are used, the blacklist will prevent a user-friendly name from being assigned to the internal disk, but there will still be a mapping for the internal disk.

Logs:
vgs --options vg_name,pv_name

BEFORE activating multipath
  PV        VG
  /dev/sda2  vg_dhcp135157577

AFTER activating multipath
  VG              PV
  vg_dhcp135157577 /dev/mapper/3500000e01453a9c0p2


Output of multipath showing the device being blacklisted:

May 10 14:02:27 | sda: (FUJITSU:MAY2036RC) vendor/product blacklisted

The multipath device:

3500000e01453a9c0 dm-0 FUJITSU,MAY2036RC
size=34G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 1:0:0:0 sda 8:0 active ready running

This multipath device will show up no matter what settings I use in the blacklist for /etc/lvm.conf and /etc/multipath.conf.  I have never experienced this problem in RHEL6.0, so this implies there was some pretty big change in multipath between 6.0 and 6.1.

Comment 2 Bryn M. Reeves 2011-05-17 17:57:45 UTC
Can you post the full output of the multipath command that produced the following line:

May 10 14:02:27 | sda: (FUJITSU:MAY2036RC) vendor/product blacklisted

Comment 3 Bryn M. Reeves 2011-05-17 18:24:13 UTC
Created attachment 499418 [details]
multipath.conf from partner system

Comment 4 Bryn M. Reeves 2011-05-17 18:24:59 UTC
Created attachment 499421 [details]
multipath -v4 -ll from partner system

Comment 5 Bryn M. Reeves 2011-05-17 18:39:00 UTC
Never mind, found the files I needed in the ticket.

Was the initramfs re-made after making changes to multipath.conf? 

Also, the data in the sosreports does not seem to match that in comment #0 - the comment in the bug has dm-0 as the internal disk:

3500000e01453a9c0 dm-0 FUJITSU,MAY2036RC
size=34G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 1:0:0:0 sda 8:0 active ready running

Which would stand to reason if this was being created by a stale initramfs configuration.

But looking at the sosreport data it's moved up to dm-1:

3500000e014719ab0 dm-1 FUJITSU,MAY2036RC
size=34G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 4:0:0:0  sdaw 67:0   active ready running

There's also a bunch of syntax errors in multipath.conf:

May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 4, invalid keyword: selector
May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 34, invalid keyword: polling_interval
May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 52, invalid keyword: polling_interval
May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 70, invalid keyword: polling_interval
May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 88, invalid keyword: polling_interval
May 17 09:32:33 kswc-achilles multipathd: multipath.conf line 106, invalid keyword: polling_interval

The errors for polling_interval are caused by specifying this keyword in a device block - this is a global value for multipathd and cannot be specified on a per-device basis. The selector one is because this keyword is now "path_selector" (it's a bit confusing as there are still some examples floating around that use "selector" and I think the deprecated keyword "default_selector" is still supported but "selector" alone will trigger an error).

Below these is a rename:

May 17 09:32:34 kswc-achilles multipathd: 3500000e014719ab0: rename 3500000e014719ab0 to mpathy

This also makes me think there might be a stale initramfs creating the rogue device here. Could you attach the initramfs image for the running kernel either here or to the support ticket? (or re-run dracut and see if the problem goes away).

Comment 6 Sean Stewart 2011-05-17 21:19:38 UTC
The device changed from dm-0 to dm-1 after changing some of the parameters in multipath.conf. Sorry for the confusion, there.

I ran dracut -f and then rebooted, and it caused my system to give this message every time:
"Kernel panic - not syncing: Attempted to kill init!"

A couple of other RH6.1 RC2 hosts in our lab are hitting the same problem..  I reinstalled with RC3 to see if either issue could be hit.  I was able to work around the problem in the bug description by blacklisting all devices, except the external storage devices, running dracut -f and rebooting.  So far, I have not yet hit the kernel panic again.

Comment 7 Bryn M. Reeves 2011-05-17 22:50:51 UTC
Please can you include the full output of commands that you have run, complete boot logs leading up to the panic or the initramfs images themselves as it is not possible to debug a boot failure like this from the limited information in comment #6.

If you'd like to upload the images I'd be happy to take a look - the support ticket may be a better location however as the ticketing system has a larger attachment size limit (dracut images generated without -H can be quite large as they include modules for "generic" configurations).

Comment 8 Sean Stewart 2011-05-17 23:21:41 UTC
I will have to get back to you on that, as my system is gone, but one of my coworkers did save the initramfs for one of the other systems that experienced the same problem.  On it, it gave a message about not being able to load modules.dep, so it looks like during boot it was unable to figure out what drivers to load.  Unrwapping the initial ramdisk showed that the modules.dep file was in fact missing from it.  In that case, I know an LSI SAS driver was compiled, depmod -a was run, and the initramfs was recreated, and then device mapper multipath was activated and dracut -f was run.  In my configuration, it was up and running a cluster with 24 volumes.  All I did was modify multipath.conf and issue dracut -f, then reboot. The third case we have hit this was a little more complicated.

I will see how much info we can gather and submit it first thing tomorrow.  Wouldn't it be more appropriate to submit it as a separate bugzilla?  This one can probably be closed, if what I originally described is the expected behavior of multipath.

Comment 9 Bryn M. Reeves 2011-05-18 00:00:52 UTC
Thanks - that would be helpful. If module.dep was missing it really sounds like depmod was not run (but this is sounds like a 3rd party module build so it's hard to say) but it's impossible to know for sure without dracut output or the resulting image.

I think it would be better to keep it in the existing support ticket for now and we can file new bugs as appropriate.

Comment 10 Sean Stewart 2011-05-18 19:56:18 UTC
Created attachment 499682 [details]
initramfs that doesn't allow the system to boot

Here's an initramfs from a system that hit the panic after issuing dracut -f.  It's less than half the size of the working initramfs.

Comment 11 Ben Marzinski 2011-06-06 17:53:47 UTC
So, have you been able to create a proper initramfs after enabling multipath? If so, does that solve your problem?

Comment 12 Suzanne Logcher 2011-10-06 18:41:38 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.
               
Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 13 Sean Stewart 2012-01-17 18:02:08 UTC
Yes, recreating the initramfs seems to solve the problem.  I'd say we can go ahead and close this, and if I find another, more specific problem, I'll open a new bug.


Note You need to log in before you can comment on or make changes to this bug.