There are two bugs in multipathd that keep it from switching into recovery mode when the last path has failed, and then disabling queue_if_no_path. First off, during device deletion, if a multipath device tries to reload when the last path fails, but the path gets deleted while it is trying, the multipath device can lose track of what its device type is. This keeps it from using the device specific configuration. If the no_path_retry value in the defaults section doesn't match the value in the devices section for the multipath device, the device may fail to disable queueing. Second, When calculating the number of active paths, multipath usually counts both active and ghost paths. However there is one place in the code where only active paths are counted. If a ghost path fails after the number of active paths has been counted using only active paths, the number of active paths will be wrong. Since multipath will only go into recovery mode when there are 0 active paths, this will keep multipath from going into recovery mode at the right time.
Test packages are available here: http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/i386/ and http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/x86_64/
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: If a device's last path was deleted while the multipathd daemon was trying to reload the device map, or if a ghost path failed, multipathd did not always switch into the recovery mode. As a result, multipath devices could not recover I/O operations in setups that were supposed to temporarily queue I/O if all paths were down. Multipath now correctly recovers I/O operations as configured.
This is a rhel5 bug, and this fix is built into the rhel5 tarball. However, now that I look at RHEL6, there's a part of this fix that is missing. The rest of it is also already in the rhel6 tarball. I'm going to open a bug for the missing piece.
for queue_if_no_path issue, it has been fixed by device-mapper-multipath-0.4.7-46.el5. Tested with emc_clariion_checker and set up no_path_retry as 5 in devices setcion of multipath.conf Using dd for generating I/O. And these command for bring disk offline: === for X in `echo "sdg sdi sdw sdy"`;do echo offline > "/sys/block/$X/device/state" done === Before patch, we will got incorrect no_path_retry_nr: === mpath5: Entering recovery mode: max_retries=60 === After: mpath5: Entering recovery mode: max_retries=5 For PATH_GHOST issue, not able to reproduce it as lacking of specified path_checker: "rdac, tur, and hp_sw path checkers" Also tried to change path_checker of DGC, but still not able to reproduce it. Can we get any partner test to PATH_GHOST issue?
I opened a ticket Case 00484908 about a similar issue detected on RHEL 5.6 X86_64 platform. please See Qlogic forum : http://solutions.qlogic.com/KanisaSupportSite/browse.do?BROWSE_forum.NodeType=leaf&WidgetName=BROWSE_forum&BROWSE_forum.thisPageUrl=%2Fforum%2Fforumshome.do&NodeType=leaf&NodeName=Fibre+Channel+Linux&TaxoName=FB_ForumBrowse&BROWSE_forum.NodeId=FB_HBA_LINUX_1_2&BROWSE_forum.IsForum=true&NodeId=FB_HBA_LINUX_1_2&id=m3&BROWSE_forum.TaxoName=FB_ForumBrowse&AppContext=AC_ForumCategoryPage
(In reply to comment #17) > I opened a ticket Case 00484908 about a similar issue detected on RHEL 5.6 > X86_64 platform. > > please See Qlogic forum : > http://solutions.qlogic.com/KanisaSupportSite/browse.do?BROWSE_forum.NodeType=leaf&WidgetName=BROWSE_forum&BROWSE_forum.thisPageUrl=%2Fforum%2Fforumshome.do&NodeType=leaf&NodeName=Fibre+Channel+Linux&TaxoName=FB_ForumBrowse&BROWSE_forum.NodeId=FB_HBA_LINUX_1_2&BROWSE_forum.IsForum=true&NodeId=FB_HBA_LINUX_1_2&id=m3&BROWSE_forum.TaxoName=FB_ForumBrowse&AppContext=AC_ForumCategoryPage Does the rpm in comment #1 solve your problem?
Tried 100 times on EMC CX (active/passive setup) using tur path_checker. Cannot hit active paths: <negative_number> issue. Basic function test will be noted in errata. Sanity Only.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1032.html