Bug 677821

Summary:	multipathd occassionally doesn't stop queuing after no_path_retry times out.
Product:	Red Hat Enterprise Linux 5	Reporter:	Ben Marzinski <bmarzins>
Component:	device-mapper-multipath	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED ERRATA	QA Contact:	Gris Ge <fge>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	5.6	CC:	agk, bdonahue, bmarzins, bmr, dwa, dwysocha, fge, guy.legac, heinzm, mbroz, prajnoha, prockai, qcai, zkabelac
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	If a device's last path was deleted while the multipathd daemon was trying to reload the device map, or if a ghost path failed, multipathd did not always switch into the recovery mode. As a result, multipath devices could not recover I/O operations in setups that were supposed to temporarily queue I/O if all paths were down. Multipath now correctly recovers I/O operations as configured.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-07-21 08:23:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	683447

Description Ben Marzinski 2011-02-15 23:27:03 UTC

There are two bugs in multipathd that keep it from switching into recovery mode when the last path has failed, and then disabling queue_if_no_path.

First off, during device deletion, if a multipath device tries to reload when the last path fails, but the path gets deleted while it is trying, the multipath device can lose track of what its device type is.  This keeps it from using the device specific configuration.  If the no_path_retry value in the defaults section doesn't match the value in the devices section for the multipath device, the device may fail to disable queueing.

Second, When calculating the number of active paths, multipath usually counts both active and ghost paths. However there is one place in the code where only active paths are counted.  If a ghost path fails after the number of active paths has been counted using only active paths, the number of active paths will be wrong.  Since multipath will only go into recovery mode when there are 0 active paths, this will keep multipath from going into recovery mode at the right time.

Comment 1 Ben Marzinski 2011-02-16 18:15:58 UTC

Test packages are available here:

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/i386/

and

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/x86_64/

Comment 9 Ben Marzinski 2011-04-08 02:51:01 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
If a device's last path was deleted while the multipathd daemon was trying to reload the device map, or if a ghost path failed, multipathd did not always switch into the recovery mode. As a result, multipath devices could not recover I/O operations in setups that were supposed to temporarily queue I/O if all paths were down. Multipath now correctly recovers I/O operations as configured.

Comment 14 Ben Marzinski 2011-04-27 16:00:44 UTC

This is a rhel5 bug, and this fix is built into the rhel5 tarball. However, now that I look at RHEL6, there's a part of this fix that is missing.  The rest of it is also already in the rhel6 tarball. I'm going to open a bug for the missing piece.

Comment 16 Gris Ge 2011-05-30 05:33:10 UTC

for queue_if_no_path issue, it has been fixed by device-mapper-multipath-0.4.7-46.el5.

Tested with emc_clariion_checker and set up no_path_retry as 5 in devices setcion of multipath.conf

Using dd for generating I/O. And these command for bring disk offline:
===
for X in `echo "sdg sdi sdw sdy"`;do
    echo offline > "/sys/block/$X/device/state"
done
===
Before patch, we will got incorrect no_path_retry_nr:
===
mpath5: Entering recovery mode: max_retries=60
===
After:
mpath5: Entering recovery mode: max_retries=5


For PATH_GHOST issue, not able to reproduce it as lacking of specified path_checker: "rdac, tur, and hp_sw path checkers"

Also tried to change path_checker of DGC, but still not able to reproduce it.

Can we get any partner test to PATH_GHOST issue?

Comment 17 guy le gac 2011-06-21 13:14:31 UTC

I opened a ticket Case 00484908 about a similar issue detected on RHEL 5.6 X86_64 platform.

please See  Qlogic forum : http://solutions.qlogic.com/KanisaSupportSite/browse.do?BROWSE_forum.NodeType=leaf&WidgetName=BROWSE_forum&BROWSE_forum.thisPageUrl=%2Fforum%2Fforumshome.do&NodeType=leaf&NodeName=Fibre+Channel+Linux&TaxoName=FB_ForumBrowse&BROWSE_forum.NodeId=FB_HBA_LINUX_1_2&BROWSE_forum.IsForum=true&NodeId=FB_HBA_LINUX_1_2&id=m3&BROWSE_forum.TaxoName=FB_ForumBrowse&AppContext=AC_ForumCategoryPage

Comment 18 Gris Ge 2011-06-22 07:47:14 UTC

(In reply to comment #17)
> I opened a ticket Case 00484908 about a similar issue detected on RHEL 5.6
> X86_64 platform.
> 
> please See  Qlogic forum :
> http://solutions.qlogic.com/KanisaSupportSite/browse.do?BROWSE_forum.NodeType=leaf&WidgetName=BROWSE_forum&BROWSE_forum.thisPageUrl=%2Fforum%2Fforumshome.do&NodeType=leaf&NodeName=Fibre+Channel+Linux&TaxoName=FB_ForumBrowse&BROWSE_forum.NodeId=FB_HBA_LINUX_1_2&BROWSE_forum.IsForum=true&NodeId=FB_HBA_LINUX_1_2&id=m3&BROWSE_forum.TaxoName=FB_ForumBrowse&AppContext=AC_ForumCategoryPage

Does the rpm in comment #1 solve your problem?

Comment 22 Gris Ge 2011-06-23 09:21:26 UTC

Tried 100 times on EMC CX (active/passive setup) using  tur path_checker.

Cannot hit active paths: <negative_number> issue.

Basic function test will be noted in errata.

Sanity Only.

Comment 23 errata-xmlrpc 2011-07-21 08:23:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1032.html