Bug 227645

Summary: [NetApp-S 4.7 bug] DM-MP fails to configure devices due to stale sd entries in the sysfs
Product: Red Hat Enterprise Linux 4 Reporter: Martin George <marting>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Corey Marthaler <cmarthal>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: agk, andriusb, atodorov, bmarzins, christophe.varoqui, coughlan, dwysocha, egoggin, junichi.nomura, kueda, lmb, mbroz, prockai, tranlan, xdl-redhat-bugzilla
Target Milestone: ---Keywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-24 14:59:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 246627, 252336, 367631    
Attachments:
Description Flags
multipath -ll -v6 output as requested
none
multipath -ll -v6 & multipath -v6 outputs as requested none

Description Martin George 2007-02-07 11:09:52 UTC
Description of problem:
While configuring dm-mp devices on a RHEL4 U3 host, there have been cases where
the dm-mp driver fails to create appropriate device maps if stale sd entries are
present in the sysfs i.e. configuring dm-mp devices fails on the host. Due to
this, no dm-mp entries show up in /dev/mapper/ directory as well as in
"multipath -l/-ll" output.

In such cases, the scsi_id command fails for the specified sd entry. For eg.
suppose sdc is one such device. Now the "scsi_id -gus /block/sdc" command gives
the following output:

"3:0:0:0: page 0 not available"

A workaround for this would be to blacklist the corresponding sd entry in the
multipath.conf file. This would help in properly configuring dm-mp devices on
the host.

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.5-12.0.RHEL4

How reproducible:
Not always. But regularly.

Steps to Reproduce:
1. Configure dm-mp devices on any host where the "scsi_id -gus /block/<sd>"
fails on a sd entry in the sysfs.

Actual results:
dm-mp fails to configure devices in the above scenario. Correspondingly, no
entries are seen in /dev/mapper/ as well as in "multipath -l/-ll" outputs.

Expected results:
dm-mp should have properly configured devices for the above scenario.

Additional info:

Comment 1 Ben Marzinski 2007-03-29 23:16:10 UTC
Can you run

# multipath -v6

and

# multipath -ll -v6

and copy the results into this bug. I'm not sure sure exactly where this is
failing.  Also, do you know of any way to reliably create a stale sysfs entry?

Comment 2 Martin George 2007-04-04 14:41:41 UTC
This issue occurs intermittantly. Right now, I don't have a host which exhibits
this behavior..so I am unable to provide you with the multipath output as
requested. 

And by stale sysfs entry, I meant a sd entry that does not respond to the
"scsi_id -gus /block/<sd>" command. I am not sure how this entry came into being
in the first place. But this sd entry name kept shifting across reboots. 

But whats evident here is that dm-mp does not configure any devices if the
scsi_id command fails on a sysfs sd entry (if its not blacklisted). Does this
mean that dm-mp always expects scsi_id to pass for all corresponding sd entries?

Comment 3 Ben Marzinski 2007-04-05 18:13:46 UTC
No. failing the getuid callout (usually scsi_id) will not cause multipath to
fail in this way.  However, multipath relies on sysfs for multiple pieces of
information. Obviously, the stale sd entry is messing with one of these checks,
and multipath isn't handling the failure correctly.  I was hoping that the
multipath -v6 output would point to where the failure was happening.

There's not that many sysfs interactions in multipath. Even without any hints
from the debugging output, I should be able to track this down fairly easily.
However, If you do see this again, please run those commands and put the output
in the bugzilla.

Comment 4 RHEL Program Management 2007-05-09 07:53:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Andrius Benokraitis 2007-06-28 13:54:42 UTC
Setting to NEEDINFO on NetApp to report debuginfo if and when it can be
reproduced. This is ongoing.

Comment 8 Martin George 2007-06-29 11:23:20 UTC
Created attachment 158195 [details]
multipath -ll -v6 output as requested

Comment 9 Martin George 2007-06-29 11:27:09 UTC
Created attachment 158197 [details]
multipath -ll -v6 & multipath -v6 outputs as requested

Comment 10 Martin George 2007-06-29 11:37:05 UTC
Ben,

We were able to reproduce the issue on a RHEL 4.4 host. Attaching the logs as
requested.

In this case, the "scsi_id -gus /block/sdb" command failed with the following error:
"4:0:0:0: page 0 not available"

This eventually caused dm-mp to fail configuring devices (multipath -ll gave a
blank output). Once sdb was blacklisted using the devnode method in the
multipath.conf file, things came back to normal with the successful
configuration of dm-mp devices.

Comment 12 Ben Marzinski 2007-06-29 15:02:09 UTC
Thanks. That should be all I need.

Comment 13 Ben Marzinski 2007-07-19 17:48:56 UTC
Looking at this the output from these two commands, I'm confused. Both outputs
seem correct on their own.  The only issue is that they don't agree with each
other.  The multipath -v6 -ll output looks exactly like what you would expect if
you were trying to list the multipath maps, and you had none configured.  The
multipath -v6 output looks exactly like what you would expect if you ran this
command, but you already had the maps configured.  If these commands were run
one right after the other (in either order), I cannot see how you would get this
output.

Looking at the output for the multipath -v6 command, right after the

#
# all paths :
#

section, it lists the parameters of the multipath maps that are already known to
device-mapper.  The code paths for the two commands do not diverge until after
this point, however this listing is never in the multipath -v6 -ll command
output (which is exactly what should happen if there are no multipath maps known
to device-mapper) Do you know if these commands were run back to back?


Further, it seems from the multipath -v6 output, that the device already was
created, according to device mapper.  Is it possible that the device is getting
created, but the device node is not?  Of course, if the multipath -v6 -ll
command was in fact run immediately after, I cannot account for why it did not
list the device. The only answer that seems possible (but not at all likely) is
that for some reason, multipath -v6 -ll failed when talking to device mapper.
This is very odd, since the calls to device-mapper were exactly the same as with
the multipath -v6 command.

By the way, since you created this on RHEL 4.4, I looked at the
device-mapper-multipath-0.4.5-16.RHEL4 package (which is the same as the
device-mapper-multipath-0.4.5-16.1.RHEL4 package, minus some minor changes to
some EMC specific code), if you are not using one of these two pacakges, please 
upgrade multipath to 0.4.5-16.1.RHEL4, as this is the latest RHEL 4.4 package.

I can stick some error messages in where the device-mapper code could fail. But,
if this is where it is failing, there is no way for multipath to recover.  There
may be a bug I can't see here, or it may be in device-mapper itself, but until I
can find out exactly what's failing, I can't really debug it.

If you see this again, can you try to check to see if the device was actually
created by running.

dmsetup table --target multipath

If it is, and you still can't list with multipath -v6 -ll, try running that
command under gdb, and see if it is crashing.  If the command is not crashing,
and the paths get listed in the debug output, but maps are not being listed,
then it must be silently failing while trying to communicate with device-mapper.

Comment 14 Martin George 2007-07-23 09:41:34 UTC
Ben,

I'll get back to you on this.

Comment 15 Ben Marzinski 2007-08-03 02:08:24 UTC
There are a bunch of new printouts going into 4.6 to help locate this problem,
but the fix will not make 4.6.

Comment 16 Andrius Benokraitis 2007-08-15 13:53:02 UTC
Moving to RHEL 4.7 per Comment #15.

Comment 17 Ben Marzinski 2007-10-10 19:14:11 UTC
Please let me know when you recreate this problem.

Comment 18 Martin George 2007-10-10 19:29:20 UTC
Will do.

Comment 21 Tom Coughlan 2008-01-28 14:33:15 UTC
(In reply to comment #15)
> There are a bunch of new printouts going into 4.6 to help locate this problem,
> but the fix will not make 4.6.

Netapp has not been able to reproduce this so far. They will test 4.7 beta. If
the problem is not seen there, this BZ will be closed. 


Comment 24 Andrius Benokraitis 2008-06-03 03:12:15 UTC
NETAPP: Has this been tested on RHEL 4.7? This needs to be tested ASAP.

Comment 25 Martin George 2008-06-05 07:34:08 UTC
We'll test this on RHEL 4.7 and update the bugzilla accordingly. Thanks.

Comment 26 Martin George 2008-06-24 14:59:27 UTC
I've not been able to reproduce this issue. So closing this for now.