Bug 994277

Summary: multipath: fix handling of transport-offline states
Product: Red Hat Enterprise Linux 6 Reporter: mchristie
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: yanfu,wang <yanwang>
Severity: low Docs Contact:
Priority: unspecified    
Version: 6.5CC: abisogia, acathrow, agk, bdonahue, bmarzins, dwysocha, heinzm, jraju, loberman, msnitzer, prajnoha, prockai, sauchter, xiaoli, yanwang, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: device-mapper-multipath-0.4.9-69.el6 Doc Type: Bug Fix
Doc Text:
Cause: Multipath wasn't reserving enough space to hold the "transport-offline" value when it checked the paths sysfs state. Also it was running the checker on paths in the "quiesce" state. Consequence: Multipath would issue a warning message that it couldn't read the sysfs file for paths in the "transport-offline" state, and would unnecessarily fail paths in the "quiesce" state. Fix: Multipath allocates enough space for the "transport-offline" state, and puts paths in the "quiesce" state to pending. Result: Multipath no longer issues warning messages for paths in the "transport-offline" state, and no longer fails paths in the "quiesce" state.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-21 07:51:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mchristie 2013-08-06 23:39:17 UTC
Description of problem:

The iscsi layer uses a long iscsi device state, transport-offline, and the multipath tools state buffer reading code cannot handle that large a buffer. This is a request to bring in this patch:

https://www.redhat.com/archives/dm-devel/2013-February/msg00058.html

from upstream.

Without this patch the logs fill up with messages about not being able to read the file when the path is down.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ben Marzinski 2013-08-13 16:26:02 UTC
Patch applied. Thanks.

Comment 3 mchristie 2013-08-13 16:59:26 UTC
QA,

To test this just login to a iscsi target, create a multipath device using the iscsi paths, then pull a cable for longer than the iscsi replacement/recovery timeout setting (default is 2 minutes but modifyable in /etc/iscsi/iscsid.conf and the isccsiadm -m node -o update command for existing targgets).

When the iscsi replacement/recovery timeout has expired then you should see

session recovery timed out after %d secs

in /var/log/messages, and if you cat

/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/state

it will say transport-offline.


In /var/log/messages then you will then see these messages start to appear:

Jul 17 10:00:52 IONr8RED2950 multipathd: overflow in attribute '/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/state'

With the fix those messages should not appear.

Comment 5 yanfu,wang 2013-10-14 07:23:33 UTC
Reproduced on device-mapper-multipath-0.4.9-64.el6:
test setting up a multipath device on top of an iscsi device:
[root@storageqe-17 ~]# multipath -l
mpathc (1IET     00010001) dm-6 IET,VIRTUAL-DISK
size=500M features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  `- 10:0:0:1 sdd 8:48 active undef running
...
[root@storageqe-17 ~]# cat /sys/devices/platform/host10/session3/target10\:0\:0/10\:0\:0\:1/state 
running

update the iscsi replacement/recovery timeout setting:
[root@storageqe-17 ~]# iscsiadm -m node -T iqn.2013-09.com.redhat:target1 |grep timeout
node.session.timeo.replacement_timeout = 120
[root@storageqe-17 ~]# iscsiadm -m node -T iqn.2013-09.com.redhat:target1  -o update -n node.session.timeo.replacement_timeout -v 180
[root@storageqe-17 ~]# iscsiadm -m node -T iqn.2013-09.com.redhat:target1 |grep timeout
node.session.timeo.replacement_timeout = 180

Down network in iscsi target:
[root@storageqe-19 ~]# /etc/init.d/network stop
Shutting down interface eth0:  [  OK  ]
Shutting down loopback interface:  [  OK  ]

When the iscsi replacement/recovery timeout has expired, got below expected message:
Oct 14 03:09:04 storageqe-17 kernel: session3: session recovery timed out after 180 secs
Oct 14 03:09:05 storageqe-17 iscsid: connect to 10.16.67.51:3260 failed (No route to host)
Oct 14 03:09:05 storageqe-17 kernel: sd 10:0:0:1: rejecting I/O to offline device
Oct 14 03:09:05 storageqe-17 kernel: device-mapper: multipath: Failing path 8:48.
Oct 14 03:09:05 storageqe-17 multipathd: overflow in attribute '/sys/devices/platform/host10/session3/target10:0:0/10:0:0:1/state'
Oct 14 03:09:05 storageqe-17 multipathd: mpathc: sdd - directio checker reports path is down
Oct 14 03:09:05 storageqe-17 multipathd: checker failed path 8:48 in map mpathc
Oct 14 03:09:05 storageqe-17 multipathd: mpathc: remaining active paths: 0
Oct 14 03:09:10 storageqe-17 kernel: sd 10:0:0:1: rejecting I/O to offline device
Oct 14 03:09:10 storageqe-17 multipathd: overflow in attribute '/sys/devices/platform/host10/session3/target10:0:0/10:0:0:1/state'

[root@storageqe-17 ~]# cat /sys/devices/platform/host10/session3/target10\:0\:0/10\:0\:0\:1/state 
transport-offline

Verified on the fixed version without above problem.

Comment 9 errata-xmlrpc 2013-11-21 07:51:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1574.html