Red Hat Bugzilla – Bug 227645
[NetApp-S 4.7 bug] DM-MP fails to configure devices due to stale sd entries in the sysfs
Last modified: 2010-01-11 21:28:19 EST
Description of problem:
While configuring dm-mp devices on a RHEL4 U3 host, there have been cases where
the dm-mp driver fails to create appropriate device maps if stale sd entries are
present in the sysfs i.e. configuring dm-mp devices fails on the host. Due to
this, no dm-mp entries show up in /dev/mapper/ directory as well as in
"multipath -l/-ll" output.
In such cases, the scsi_id command fails for the specified sd entry. For eg.
suppose sdc is one such device. Now the "scsi_id -gus /block/sdc" command gives
the following output:
"3:0:0:0: page 0 not available"
A workaround for this would be to blacklist the corresponding sd entry in the
multipath.conf file. This would help in properly configuring dm-mp devices on
Version-Release number of selected component (if applicable):
Not always. But regularly.
Steps to Reproduce:
1. Configure dm-mp devices on any host where the "scsi_id -gus /block/<sd>"
fails on a sd entry in the sysfs.
dm-mp fails to configure devices in the above scenario. Correspondingly, no
entries are seen in /dev/mapper/ as well as in "multipath -l/-ll" outputs.
dm-mp should have properly configured devices for the above scenario.
Can you run
# multipath -v6
# multipath -ll -v6
and copy the results into this bug. I'm not sure sure exactly where this is
failing. Also, do you know of any way to reliably create a stale sysfs entry?
This issue occurs intermittantly. Right now, I don't have a host which exhibits
this behavior..so I am unable to provide you with the multipath output as
And by stale sysfs entry, I meant a sd entry that does not respond to the
"scsi_id -gus /block/<sd>" command. I am not sure how this entry came into being
in the first place. But this sd entry name kept shifting across reboots.
But whats evident here is that dm-mp does not configure any devices if the
scsi_id command fails on a sysfs sd entry (if its not blacklisted). Does this
mean that dm-mp always expects scsi_id to pass for all corresponding sd entries?
No. failing the getuid callout (usually scsi_id) will not cause multipath to
fail in this way. However, multipath relies on sysfs for multiple pieces of
information. Obviously, the stale sd entry is messing with one of these checks,
and multipath isn't handling the failure correctly. I was hoping that the
multipath -v6 output would point to where the failure was happening.
There's not that many sysfs interactions in multipath. Even without any hints
from the debugging output, I should be able to track this down fairly easily.
However, If you do see this again, please run those commands and put the output
in the bugzilla.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Setting to NEEDINFO on NetApp to report debuginfo if and when it can be
reproduced. This is ongoing.
Created attachment 158195 [details]
multipath -ll -v6 output as requested
Created attachment 158197 [details]
multipath -ll -v6 & multipath -v6 outputs as requested
We were able to reproduce the issue on a RHEL 4.4 host. Attaching the logs as
In this case, the "scsi_id -gus /block/sdb" command failed with the following error:
"4:0:0:0: page 0 not available"
This eventually caused dm-mp to fail configuring devices (multipath -ll gave a
blank output). Once sdb was blacklisted using the devnode method in the
multipath.conf file, things came back to normal with the successful
configuration of dm-mp devices.
Thanks. That should be all I need.
Looking at this the output from these two commands, I'm confused. Both outputs
seem correct on their own. The only issue is that they don't agree with each
other. The multipath -v6 -ll output looks exactly like what you would expect if
you were trying to list the multipath maps, and you had none configured. The
multipath -v6 output looks exactly like what you would expect if you ran this
command, but you already had the maps configured. If these commands were run
one right after the other (in either order), I cannot see how you would get this
Looking at the output for the multipath -v6 command, right after the
# all paths :
section, it lists the parameters of the multipath maps that are already known to
device-mapper. The code paths for the two commands do not diverge until after
this point, however this listing is never in the multipath -v6 -ll command
output (which is exactly what should happen if there are no multipath maps known
to device-mapper) Do you know if these commands were run back to back?
Further, it seems from the multipath -v6 output, that the device already was
created, according to device mapper. Is it possible that the device is getting
created, but the device node is not? Of course, if the multipath -v6 -ll
command was in fact run immediately after, I cannot account for why it did not
list the device. The only answer that seems possible (but not at all likely) is
that for some reason, multipath -v6 -ll failed when talking to device mapper.
This is very odd, since the calls to device-mapper were exactly the same as with
the multipath -v6 command.
By the way, since you created this on RHEL 4.4, I looked at the
device-mapper-multipath-0.4.5-16.RHEL4 package (which is the same as the
device-mapper-multipath-0.4.5-16.1.RHEL4 package, minus some minor changes to
some EMC specific code), if you are not using one of these two pacakges, please
upgrade multipath to 0.4.5-16.1.RHEL4, as this is the latest RHEL 4.4 package.
I can stick some error messages in where the device-mapper code could fail. But,
if this is where it is failing, there is no way for multipath to recover. There
may be a bug I can't see here, or it may be in device-mapper itself, but until I
can find out exactly what's failing, I can't really debug it.
If you see this again, can you try to check to see if the device was actually
created by running.
dmsetup table --target multipath
If it is, and you still can't list with multipath -v6 -ll, try running that
command under gdb, and see if it is crashing. If the command is not crashing,
and the paths get listed in the debug output, but maps are not being listed,
then it must be silently failing while trying to communicate with device-mapper.
I'll get back to you on this.
There are a bunch of new printouts going into 4.6 to help locate this problem,
but the fix will not make 4.6.
Moving to RHEL 4.7 per Comment #15.
Please let me know when you recreate this problem.
(In reply to comment #15)
> There are a bunch of new printouts going into 4.6 to help locate this problem,
> but the fix will not make 4.6.
Netapp has not been able to reproduce this so far. They will test 4.7 beta. If
the problem is not seen there, this BZ will be closed.
NETAPP: Has this been tested on RHEL 4.7? This needs to be tested ASAP.
We'll test this on RHEL 4.7 and update the bugzilla accordingly. Thanks.
I've not been able to reproduce this issue. So closing this for now.