Bug 524178

Summary: Got error output when restart multipathd service
Product: Red Hat Enterprise Linux 5 Reporter: Yufang Zhang <yuzhang>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.4CC: abaron, agk, apevec, bmarzins, bmr, christophe.varoqui, cpelland, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, llim, lmb, mbroz, mburns, mgoodwin, mkenneth, mnovacek, ovirt-maint, prockai, Rhev-m-bugs, riek, tao, tranlan, vbian, ykaul, zliu
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 532765 (view as bug list) Environment:
Last Closed: 2010-03-30 08:32:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 521258, 532765, 533951, 545219, 545578, 556823    

Description Yufang Zhang 2009-09-18 08:56:39 UTC
Description of problem:
After a fresh install of rhevh with rhevm configured,I restart multipathd service and got the follow error output:

device-mapper: table: 253:7: multipath: error getting device
device-mapper: ioctl: error adding target to table
device-mapper: table: 253:7: multipath: error getting device
device-mapper: ioctl: error adding target to table

Version-Release number of selected component (if applicable):
beta 20.1

How reproducible:
Always.

Steps to Reproduce:
1.
Install rhevh on a host with only local disk(with rhevm configured).
2.
service multipathd restart
Stopping multipathd daemon: [  OK  ]
Starting multipathd daemon: SELinux: initialized (dev ramfs, type ramfs), uses genfs_contexts
[  OK  ]
[root@localhost ~]# device-mapper: table: 253:7: multipath: error getting device
device-mapper: ioctl: error adding target to table
device-mapper: table: 253:7: multipath: error getting device
device-mapper: ioctl: error adding target to table

  
Actual results:


Expected results:


Additional info:

Comment 2 Perry Myers 2009-09-18 11:14:44 UTC
Did you use the multipath configuration screen in the firstboot menu to configure multipath.conf and lvm.conf?  Did you manually edit multipath.conf?

Please attach logs like ovirt.log, /var/log/messages to ALL bug reports.  Also please attach relevant configuration files.  In this case the relevant configuration file would be /etc/multipath.conf.  Without logs and config files we can't diagnose whether this is user error or a legitimate problem.

Comment 3 Alan Pevec 2009-09-18 11:42:24 UTC
Also attach output of multipath -v6

Comment 4 Alan Pevec 2009-09-18 11:48:54 UTC
> 1. Install rhevh on a host with only local disk(with rhevm configured).
service multipathd restart
> 2. service multipathd restart

BTW, you should not restart multipathd service manually once Node is registered with RHEV-M, VDSM restarts the service when needed.

Comment 5 Perry Myers 2009-09-18 13:57:11 UTC
Ok after further discussion with qwan we determined that the error messages in the log file are multipathd trying to use local storage devices which are not properly blacklisted.  The reason why multipathd is not blacklisting local storage is because vdsm is overwriting with the less constrained multipath.conf file.  So these error messages are expected at this point.

RHEVH does properly blacklist local storage.  Reassigning to vdsm component.

Comment 7 Ayal Baron 2009-10-19 16:34:29 UTC
We have no way of determining what should and should not be blacklisted apriori.
the ovirt method of blacklisting everything, makes sure that nothing works.
For this exact reason we added the option of manually overriding the multipath.conf file with the private flag.

Comment 8 Perry Myers 2009-10-19 18:19:05 UTC
(In reply to comment #7)
> We have no way of determining what should and should not be blacklisted
> apriori.

Selective whitelisting during node installation is one way of determining this

> the ovirt method of blacklisting everything, makes sure that nothing works.

That's not the method we use.  oVirt uses a method of blacklisting everything and allowing the user to selectively whitelist devices during install.  This makes sure that everything works, and removes all of the spurious error messages.

This works, but as acathrow has pointed out it would be problematic to dynamically add LUNs post-installation.

> For this exact reason we added the option of manually overriding the
> multipath.conf file with the private flag.  

The current vdsm method of just removing the blacklist is not good either, since you end up with lots of meaningless error messages in your logs that makes it difficult to determine if there is a real problem or not.

There are solutions here, but none of them involve vdsm alone.  Possibly we will need changes to device-mapper-multipath (perhaps we can make device-mapper-multipath smarter about determining which devices to scan based on scsi information: i.e. never scan USB devices, never scan local SATA disks)

Ben M, do you have any thoughts on this?

Comment 9 Ayal Baron 2009-10-20 07:49:24 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > We have no way of determining what should and should not be blacklisted
> > apriori.
> 
> Selective whitelisting during node installation is one way of determining this
> 
> > the ovirt method of blacklisting everything, makes sure that nothing works.
> 
> That's not the method we use.  oVirt uses a method of blacklisting everything
> and allowing the user to selectively whitelist devices during install.  This
> makes sure that everything works, and removes all of the spurious error
> messages.
Blacklisting everything by default and then manually whitelisting devices is very problematic for us.  First of all, that would mean that the system will not work by default which is a. a regression b. not the defined functionality.
Second, this would require a lot of logic on our side for something that seems to me the totally wrong approach (preventing normal operation until manual configuration is done because of a few warning messages in the log).  The normal use case is for there to be a few local disks which we do not want to multipath and the rest we do.  The few error messages can be eliminated later by manual override.
We could ofcourse add heuristics as to which devices should be blacklisted and which should not but that would still require the user to manually override when our heuristics are wrong and again seems like an overkill for a minor issue.
> 
> This works, but as acathrow has pointed out it would be problematic to
> dynamically add LUNs post-installation.
> 
> > For this exact reason we added the option of manually overriding the
> > multipath.conf file with the private flag.  
> 
> The current vdsm method of just removing the blacklist is not good either,
> since you end up with lots of meaningless error messages in your logs that
> makes it difficult to determine if there is a real problem or not.
> 
> There are solutions here, but none of them involve vdsm alone.  Possibly we
> will need changes to device-mapper-multipath (perhaps we can make
> device-mapper-multipath smarter about determining which devices to scan based
> on scsi information: i.e. never scan USB devices, never scan local SATA disks)
> 
> Ben M, do you have any thoughts on this?

Comment 10 Perry Myers 2009-10-20 12:55:44 UTC
(In reply to comment #9)
> Blacklisting everything by default and then manually whitelisting devices is
> very problematic for us.  First of all, that would mean that the system will
> not work by default which is a. a regression b. not the defined functionality.

It depends on what you mean by 'by default'.  By default a system administrator always has to install either a base RHEL system or a RHEVH system.  If the administrator configures these systems during installation to identify the attached SAN devices, then 'by default' everything just works.

Not sure where the fear of having system administrators configure storage comes from.  It's required on RHEL systems today to get SAN storage working properly, so sysadmins are familiar with needing to configure lvm.conf and multipath.conf to make things work properly.

> Second, this would require a lot of logic on our side for something that seems
> to me the totally wrong approach (preventing normal operation until manual
> configuration is done because of a few warning messages in the log).  The
> normal use case is for there to be a few local disks which we do not want to
> multipath and the rest we do.  The few error messages can be eliminated later
> by manual override.
> We could ofcourse add heuristics as to which devices should be blacklisted and
> which should not but that would still require the user to manually override
> when our heuristics are wrong and again seems like an overkill for a minor
> issue.

Flooding the logs with spurious error messages is not a minor issue.  It makes the system difficult to support, and customers looking at the logs will constantly be asking questions about them.

In any case, I stated later in Comment #8 that I think the right solution here is to make some changes to device-mapper-multipath to handle this in a more graceful manner so that you could safely remove the blacklist "*" without generating error messages for any device that is not actually a SAN device.

Comment 14 Ben Marzinski 2009-10-23 15:25:04 UTC
It definitely seems possible to increase multipaths blacklisting ability.  I'm not sure how much work is involved. I either need access to some SATA/USB storage, or
I need some instructions on the best way to differentiate it from SAS/FC/iSCSI.  Also, doesn't SATA hardware support multipath now?

Comment 15 Perry Myers 2009-10-23 15:33:53 UTC
(In reply to comment #14)
> It definitely seems possible to increase multipaths blacklisting ability.  I'm
> not sure how much work is involved. I either need access to some SATA/USB
> storage, or

Can you just borrow a usb thumbdrive from someone?  And you don't have a box with any local disks in it?  That would be a good starting point...

> I need some instructions on the best way to differentiate it from SAS/FC/iSCSI.
>  Also, doesn't SATA hardware support multipath now?  

If it does, then it doesn't need to be blacklisted :)

Comment 16 Ayal Baron 2009-10-27 12:52:40 UTC
(In reply to comment #14)
> It definitely seems possible to increase multipaths blacklisting ability.  I'm
> not sure how much work is involved. I either need access to some SATA/USB
> storage, or
> I need some instructions on the best way to differentiate it from SAS/FC/iSCSI.
>  Also, doesn't SATA hardware support multipath now?  

Hi Ben,
Since this bug is not in the vdsm domain, should I reassign this to you?
What component should it be moved to?

Comment 17 Ben Marzinski 2009-10-28 01:16:52 UTC
Sure.

Comment 18 Alan Pevec 2009-11-03 17:02:39 UTC
(adding edited comments from Perry and Ben, from an off-line thread)

There are two cases to handle then:

1. If there is no WWID for a specific device, note this in the log and
  never look at that device again

2. If there is a WWID for the device, but the device is already in use
   (local filesystems/boot devices/LVM), 
   print more friendly message to log stating that device is in use,
   providing a bit more detail so that customers are not scared,
   and think that their storage is failing.
   Maybe adding a line that says
   "Block device /dev/foo appears to be in use, not adding to multipath"
   or something.

> You shouldn't get repeat errors for a device that fails because it is in use.
> You get one when multipathd starts, and another one if you do
> # service multipathd reload
> since that forces a rescan of every device.

Comment 20 Bryn M. Reeves 2009-11-06 15:25:44 UTC
I can probably find some boxes with SATA, USB, and SAS storage locally attached if that helps. Can also post sample output from these systems for reference.

We can differentiate between these types of storage using sysfs attributes. Right now, iSCSI and FC are the main transports for multipath systems although certain CCISS controller configurations are also multipath capable.

I think it's technically possible with SATA II to support multiple point-to-point SATA connections and there are a few NAS-type boxes on the market that feature dual eSATA ports (I assume this is primarily for link aggregation) but I don't think that we would support any of these with current multipath-tools.

Comment 21 Ben Marzinski 2009-12-03 22:20:45 UTC
I can easily add code to print out a message if a device cannot be multipathed because it has no WWID.  Fixing the case where the device is already in use is a lot harder.  First, there's no way to know that's why device-mapper failed.  It doesn't provide any handy error codes.  But the bigger issue is that this happens all the time.  Everytime you have your non-multipathed root filesystem on a scsi device, and don't bother to blacklist it, this happens.  That case might describe the majority of multipath users.  If I add a nice warning message, then suddenly, a large number of multipath users (maybe most) will suddenly start seeing a warning message whenever they run multipath. For multipathd, this will appear in
the logs, right next to the kernel message, and hopefully make people less worried.  For  multipath, it will appear whenever you run multipath without the -l option, and it will appear in their shell, where previously, no error message were reported. Adding a new warning message that a large number of customers are likely to see, seems like it will do more harm than good.

I can add a nice warning message that only happens when you bump the verbosity to -v3 (the default verbosity is -v2).  However, -v3 prints a whole heck of a lot of stuff, so the warning message will likely get lost in the mix, but at least if people are worried, they can bump up the verbosity, and see it, if they bother to dig through the debugging output.

So unless anyone has a serious disagreement, that's what I'm going to do.

Comment 22 Ben Marzinski 2009-12-04 21:37:44 UTC
Fixed. multipath now prints a helpful message if it fails to multipath a device because the device has no wwid.  If multipath is run with -v3, it will also print a message if it can't multipath a device because it is in use.

Comment 24 michal novacek 2009-12-23 14:31:43 UTC
old: 0.4.7-30.el5
new: 0.4.7-31.el5

New version now shows the message correctly the message only with bigger runlevel -v3.

Comment 29 Ben Marzinski 2010-02-03 15:09:02 UTC
Each device is sent the to kernel seperately, so comment #27 can't be what's at fault.  If device stop getting built after an error, then it must be userspace's fault.  Also, the changes related to this bug are simply message printing.  We should probably open a new bug for this issue, instead of using this one.

Comment 30 Issue Tracker 2010-02-15 23:32:29 UTC
Event posted on 02-16-2010 10:32am EST by mgoodwin


Hi Eduardo, it would also be a very interesting data-point if /var is
a submount but change the "locking_dir" setting in lvm.conf to
/etc/lvm/locks
(or somewhere on the rootfs). Knowing this will help isolate the root
cause
to lvm or multipath, one way or the other.

Thanks
-- Mark




This event sent from IssueTracker by mgoodwin 
 issue 430463

Comment 31 errata-xmlrpc 2010-03-30 08:32:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0255.html