Bug 227645
Summary: | [NetApp-S 4.7 bug] DM-MP fails to configure devices due to stale sd entries in the sysfs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Martin George <marting> | ||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Corey Marthaler <cmarthal> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.7 | CC: | agk, andriusb, atodorov, bmarzins, christophe.varoqui, coughlan, dwysocha, egoggin, junichi.nomura, kueda, lmb, mbroz, prockai, tranlan, xdl-redhat-bugzilla | ||||||
Target Milestone: | --- | Keywords: | OtherQA | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-06-24 14:59:27 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 246627, 252336, 367631 | ||||||||
Attachments: |
|
Description
Martin George
2007-02-07 11:09:52 UTC
Can you run # multipath -v6 and # multipath -ll -v6 and copy the results into this bug. I'm not sure sure exactly where this is failing. Also, do you know of any way to reliably create a stale sysfs entry? This issue occurs intermittantly. Right now, I don't have a host which exhibits this behavior..so I am unable to provide you with the multipath output as requested. And by stale sysfs entry, I meant a sd entry that does not respond to the "scsi_id -gus /block/<sd>" command. I am not sure how this entry came into being in the first place. But this sd entry name kept shifting across reboots. But whats evident here is that dm-mp does not configure any devices if the scsi_id command fails on a sysfs sd entry (if its not blacklisted). Does this mean that dm-mp always expects scsi_id to pass for all corresponding sd entries? No. failing the getuid callout (usually scsi_id) will not cause multipath to fail in this way. However, multipath relies on sysfs for multiple pieces of information. Obviously, the stale sd entry is messing with one of these checks, and multipath isn't handling the failure correctly. I was hoping that the multipath -v6 output would point to where the failure was happening. There's not that many sysfs interactions in multipath. Even without any hints from the debugging output, I should be able to track this down fairly easily. However, If you do see this again, please run those commands and put the output in the bugzilla. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Setting to NEEDINFO on NetApp to report debuginfo if and when it can be reproduced. This is ongoing. Created attachment 158195 [details]
multipath -ll -v6 output as requested
Created attachment 158197 [details]
multipath -ll -v6 & multipath -v6 outputs as requested
Ben, We were able to reproduce the issue on a RHEL 4.4 host. Attaching the logs as requested. In this case, the "scsi_id -gus /block/sdb" command failed with the following error: "4:0:0:0: page 0 not available" This eventually caused dm-mp to fail configuring devices (multipath -ll gave a blank output). Once sdb was blacklisted using the devnode method in the multipath.conf file, things came back to normal with the successful configuration of dm-mp devices. Thanks. That should be all I need. Looking at this the output from these two commands, I'm confused. Both outputs seem correct on their own. The only issue is that they don't agree with each other. The multipath -v6 -ll output looks exactly like what you would expect if you were trying to list the multipath maps, and you had none configured. The multipath -v6 output looks exactly like what you would expect if you ran this command, but you already had the maps configured. If these commands were run one right after the other (in either order), I cannot see how you would get this output. Looking at the output for the multipath -v6 command, right after the # # all paths : # section, it lists the parameters of the multipath maps that are already known to device-mapper. The code paths for the two commands do not diverge until after this point, however this listing is never in the multipath -v6 -ll command output (which is exactly what should happen if there are no multipath maps known to device-mapper) Do you know if these commands were run back to back? Further, it seems from the multipath -v6 output, that the device already was created, according to device mapper. Is it possible that the device is getting created, but the device node is not? Of course, if the multipath -v6 -ll command was in fact run immediately after, I cannot account for why it did not list the device. The only answer that seems possible (but not at all likely) is that for some reason, multipath -v6 -ll failed when talking to device mapper. This is very odd, since the calls to device-mapper were exactly the same as with the multipath -v6 command. By the way, since you created this on RHEL 4.4, I looked at the device-mapper-multipath-0.4.5-16.RHEL4 package (which is the same as the device-mapper-multipath-0.4.5-16.1.RHEL4 package, minus some minor changes to some EMC specific code), if you are not using one of these two pacakges, please upgrade multipath to 0.4.5-16.1.RHEL4, as this is the latest RHEL 4.4 package. I can stick some error messages in where the device-mapper code could fail. But, if this is where it is failing, there is no way for multipath to recover. There may be a bug I can't see here, or it may be in device-mapper itself, but until I can find out exactly what's failing, I can't really debug it. If you see this again, can you try to check to see if the device was actually created by running. dmsetup table --target multipath If it is, and you still can't list with multipath -v6 -ll, try running that command under gdb, and see if it is crashing. If the command is not crashing, and the paths get listed in the debug output, but maps are not being listed, then it must be silently failing while trying to communicate with device-mapper. Ben, I'll get back to you on this. There are a bunch of new printouts going into 4.6 to help locate this problem, but the fix will not make 4.6. Moving to RHEL 4.7 per Comment #15. Please let me know when you recreate this problem. Will do. (In reply to comment #15) > There are a bunch of new printouts going into 4.6 to help locate this problem, > but the fix will not make 4.6. Netapp has not been able to reproduce this so far. They will test 4.7 beta. If the problem is not seen there, this BZ will be closed. NETAPP: Has this been tested on RHEL 4.7? This needs to be tested ASAP. We'll test this on RHEL 4.7 and update the bugzilla accordingly. Thanks. I've not been able to reproduce this issue. So closing this for now. |