Bug 996628 - device-mapper path mark as failed
device-mapper path mark as failed
Status: ASSIGNED
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
4.7
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Ben Marzinski
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-13 10:39 EDT by tchek
Modified: 2014-01-08 13:31 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description tchek 2013-08-13 10:39:50 EDT
Description of problem:

all minutes I have I/O error due to path failed like this :

kernel: SCSI error : <2 0 1 0> return code = 0x6000000
Aug 13 15:22:03 PX1920PRD001 kernel: end_request: I/O error, dev sdh, sector 147415768
Aug 13 15:22:03 PX1920PRD001 kernel: device-mapper: dm-multipath: Failing path 8:112.
Aug 13 15:22:03 PX1920PRD001 kernel: end_request: I/O error, dev sdh, sector 147415776
Aug 13 15:22:03 PX1920PRD001 multipathd: 8:112: mark as failed
Aug 13 15:22:03 PX1920PRD001 multipathd: mpath1: remaining active paths: 3
Aug 13 15:22:33 PX1920PRD001 multipathd: 8:112: tur checker reports path is up
Aug 13 15:22:33 PX1920PRD001 multipathd: 8:112: reinstated
Aug 13 15:22:33 PX1920PRD001 multipathd: mpath1: remaining active paths: 4

all loop of san is OK no error on switch and baie.

application is slow due to I/O error.

/etc/multipath.conf

devnode_blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^(hd|xvd|vd)[a-z]*"
        wwid 3600508e000000000715541cadb3ff004

}

## Use user friendly names, instead of using WWIDs as names.

defaults {
      polling_interval       30
      failback               immediate
      no_path_retry          10
      rr_min_io              100
      path_checker           tur
      user_friendly_names    yes
}


devices {
           device {
                 vendor                  "IBM"
                 product                 "2145"
                 path_grouping_policy    group_by_prio
                 prio_callout            "/sbin/mpath_prio_alua /dev/%n"
                features                "0"
                                           }

}


Version-Release number of selected component (if applicable):

rhel 4.7

device-mapper-1.02.28-2.el4
device-mapper-1.02.28-2.el4
device-mapper-multipath-0.4.5-42.el4

Emulex LightPulse Fibre Channel SCSI driver 8.0.16.46
Firmware 2.01A9 (U2F2.01A9)

Baie storwize V7000 SVC 6.4




How reproducible:

start server and look /var/log/messages

Steps to Reproduce:
1.
2.
3.

Actual results:

kernel: SCSI error : <2 0 1 0> return code = 0x6000000
Aug 13 15:22:03 PX1920PRD001 kernel: end_request: I/O error, dev sdh, sector 147415768
Aug 13 15:22:03 PX1920PRD001 kernel: device-mapper: dm-multipath: Failing path 8:112.
Aug 13 15:22:03 PX1920PRD001 kernel: end_request: I/O error, dev sdh, sector 147415776
Aug 13 15:22:03 PX1920PRD001 multipathd: 8:112: mark as failed
Aug 13 15:22:03 PX1920PRD001 multipathd: mpath1: remaining active paths: 3
Aug 13 15:22:33 PX1920PRD001 multipathd: 8:112: tur checker reports path is up
Aug 13 15:22:33 PX1920PRD001 multipathd: 8:112: reinstated
Aug 13 15:22:33 PX1920PRD001 multipathd: mpath1: remaining active paths: 4

Expected results:


Additional info:

I try to fix features=0 in /etc/multipath.conf but doesn't work. ([features="1 queue_if_no_path"])
I have also try to fix it with dmsetup message mpath1 0 "fail_if_no_path" ([features="0"]) but after 5 minutes it roll back to [features="1 queue_if_no_path"]
Comment 1 Ben Marzinski 2013-10-22 12:24:39 EDT
stopping queue_if_no_path shouldn't fix this issue, since you are never running out of paths.  Can you show me the output of

# multipath -ll

The issue is that the TUR checker is reporting your path as working, but the IO is failing.  There are a number of reasons why this could happen.  It's possible that your storage hardware is active/passive, and it needs to be initialized before it can receive IO. This would require a hardware handler. 

It's also possible that the IO is failing for some reason unrelated to multipath. Unfortunately, in RHEL4, multipath has no way to determine whether or not an IO error should be retried on another path, or simply passed up.  This is not something that will be backported to RHEL4.  To fix that, you need to upgrade, preferably to RHEL6.

Looking at the output of

# multipath -ll

will give me a better idea what could be wrong.
Comment 2 tchek 2013-10-25 04:14:27 EDT
HI,

My problem is path sdg and sdh failed and reactivate all the time. We have try to change HBA card but the problem is persistent. 

 multipath -ll
mpath2 (36005076802810b23400000000000002d)
[size=700 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=100][active]
 \_ 1:0:1:1 sde 8:64  [active][ready]
 \_ 2:0:0:1 sdg 8:96  [failed][ready]
\_ round-robin 0 [prio=20][enabled]
 \_ 1:0:0:1 sdc 8:32  [active][ready]
 \_ 2:0:1:1 sdi 8:128 [active][ready]

mpath1 (36005076802810b23400000000000002c)
[size=250 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=100][active]
 \_ 1:0:0:0 sdb 8:16  [active][ready]
 \_ 2:0:1:0 sdh 8:112 [failed][ready]
\_ round-robin 0 [prio=20][enabled]
 \_ 1:0:1:0 sdd 8:48  [active][ready]
 \_ 2:0:0:0 sdf 8:80  [active][ready]

I can't upgrade to RHEL6 because my application is not compatible.
Comment 3 Ben Marzinski 2013-10-25 11:31:09 EDT
Can you try disabling multipath and running you application on one of the paths. If you get an IO error there, then the problem is that multipath is not correctly passing the error back up.

Is it just one of the paths that fails? Or do all the paths fail sometimes?

What kind of storage hardware are you using?

Is it possible to remove the multipath devices with

# multipath -F

and then re-add them with more verbosity

# multipath -v3

and then post the output so I can see how they get configured?
Comment 4 tchek 2014-01-08 07:46:08 EST
Hi,

I can't disable multipath but I have done test on each path :

and error is on device sdg, sdi, sdh and sdf. these device is on port of HBA 2. I have change HBA but same error message.


storage is a IBM V7000 6.4.


multipath -v3 : 


[root@PX1920PRD001 ~]# multipath -v3
load path identifiers cache
#
# all paths in cache :
#
36005076802810b23400000000000002c  1:0:0:0 sdb 8:16 50 [active] IBM     /2145
36005076802810b23400000000000002d  1:0:0:1 sdc 8:32 10 [active] IBM     /2145
36005076802810b23400000000000002c  1:0:1:0 sdd 8:48 10 [active] IBM     /2145
36005076802810b23400000000000002d  1:0:1:1 sde 8:64 50 [active] IBM     /2145
dm-0 blacklisted
dm-1 blacklisted
dm-2 blacklisted
dm-3 blacklisted
dm-4 blacklisted
dm-5 blacklisted
dm-6 blacklisted
dm-7 blacklisted
dm-8 blacklisted
dm-9 blacklisted
loop0 blacklisted
loop1 blacklisted
loop2 blacklisted
loop3 blacklisted
loop4 blacklisted
loop5 blacklisted
loop6 blacklisted
loop7 blacklisted
md0 blacklisted
ram0 blacklisted
ram10 blacklisted
ram11 blacklisted
ram12 blacklisted
ram13 blacklisted
ram14 blacklisted
ram15 blacklisted
ram1 blacklisted
ram2 blacklisted
ram3 blacklisted
ram4 blacklisted
ram5 blacklisted
ram6 blacklisted
ram7 blacklisted
ram8 blacklisted
ram9 blacklisted
path sda not found in pathvec

===== path info sda (mask 0x1f) =====
bus = 1
dev_t = 8:0
size = 285155328
vendor = LSILOGIC
product = Logical Volume
rev = 3000
h:b:t:l = 0:0:1:0
tgt_node_name =
serial =
path checker = tur (internal default)
state = 2
getprio = (null) (internal default)
prio = 1
getuid = /sbin/scsi_id -g -u -s /block/%n (internal default)
uid = 3600508e000000000715541cadb3ff004 (callout)
===== path info sdb (mask 0x1f) =====
bus = 1
dev_t = 8:16
size = 524288000
vendor = IBM
product = 2145
rev = 0000
h:b:t:l = 1:0:0:0
tgt_node_name = 0x500507680205cf9b

serial = 0200a042c8d0XX00
path checker = tur (internal default)
state = 2
getprio = /sbin/mpath_prio_alua /dev/%n (controler setting)
prio = 50
uid = 36005076802810b23400000000000002c (cache)
===== path info sdc (mask 0x1f) =====
bus = 1
dev_t = 8:32
size = 1468006400
vendor = IBM
product = 2145
rev = 0000
h:b:t:l = 1:0:0:1
tgt_node_name = 0x500507680205cf9b

serial = 0200a042c8d0XX00
path checker = tur (internal default)
state = 2
getprio = /sbin/mpath_prio_alua /dev/%n (controler setting)
prio = 10
uid = 36005076802810b23400000000000002d (cache)
===== path info sdd (mask 0x1f) =====
bus = 1
dev_t = 8:48
size = 524288000
vendor = IBM
product = 2145
rev = 0000
h:b:t:l = 1:0:1:0
tgt_node_name = 0x500507680205cf9c

serial = 0200a042c8d0XX00
path checker = tur (internal default)
state = 2
getprio = /sbin/mpath_prio_alua /dev/%n (controler setting)
prio = 10
uid = 36005076802810b23400000000000002c (cache)
===== path info sde (mask 0x1f) =====
bus = 1
dev_t = 8:64
size = 1468006400
vendor = IBM
product = 2145
rev = 0000
h:b:t:l = 1:0:1:1
tgt_node_name = 0x500507680205cf9c

serial = 0200a042c8d0XX00
path checker = tur (internal default)
state = 2
getprio = /sbin/mpath_prio_alua /dev/%n (controler setting)
prio = 50
uid = 36005076802810b23400000000000002d (cache)
sdf blacklisted
sdg blacklisted
sdh blacklisted
sdi blacklisted
sr0 blacklisted
#
# all paths :
#
36005076802810b23400000000000002c  1:0:0:0 sdb 8:16 50 [active][ready] IBM
36005076802810b23400000000000002d  1:0:0:1 sdc 8:32 10 [active][ready] IBM
36005076802810b23400000000000002c  1:0:1:0 sdd 8:48 10 [active][ready] IBM
36005076802810b23400000000000002d  1:0:1:1 sde 8:64 50 [active][ready] IBM
3600508e000000000715541cadb3ff004  0:0:1:0 sda 8:0 1 [ready] LSILOGIC/Logical
params = 1 queue_if_no_path 0 2 1 round-robin 0 1 1 8:64 100 round-robin 0 1 1 8:32 100
status = 2 0 0 0 2 1 A 0 1 0 8:64 A 0 E 0 1 0 8:32 A 0
params = 1 queue_if_no_path 0 2 1 round-robin 0 1 1 8:16 100 round-robin 0 1 1 8:48 100
status = 2 0 0 0 2 1 A 0 1 0 8:16 A 0 E 0 1 0 8:48 A 0
Found matching wwid [36005076802810b23400000000000002c] in bindings file.
Setting alias to mpath1
pgpolicy = group_by_prio (controler setting)
selector = round-robin 0 (internal default)
features = 0 (controler setting)
hwhandler = 0 (internal default)
rr_weight = 1 (internal default)
rr_min_io = 100 (config file default)
no_path_retry = 5 (config file default)
pg_timeout = NONE (internal default)
0 524288000 multipath 0 0 2 1 round-robin 0 1 1 8:16 100 round-robin 0 1 1 8:48 100
set ACT_NOTHING: map unchanged
Found matching wwid [36005076802810b23400000000000002d] in bindings file.
Setting alias to mpath2
pgpolicy = group_by_prio (controler setting)
selector = round-robin 0 (internal default)
features = 0 (controler setting)
hwhandler = 0 (internal default)
rr_weight = 1 (internal default)
rr_min_io = 100 (config file default)
no_path_retry = 5 (config file default)
pg_timeout = NONE (internal default)
0 1468006400 multipath 0 0 2 1 round-robin 0 1 1 8:64 100 round-robin 0 1 1 8:32 100
set ACT_NOTHING: map unchanged
3600508e000000000715541cadb3ff004 blacklisted
Comment 5 Ben Marzinski 2014-01-08 13:31:48 EST
So, your configuration looks fine.  If you hadn't told me that you already changed your HBA, that's what I would have advised you to try.

When you said

> I can't disable multipath but I have done test on each path :

> and error is on device sdg, sdi, sdh and sdf. these device is on port of HBA 2.
> I have change HBA but same error message.

What do you mean?  Did you try IO directly to these paths, by doing something like:

# dd if=/dev/sdd of=/dev/null iflag=direct bs=1M count=1

Where you try to read directly from the block device.  If you did something like this and it's failing, then your issue is that for some reason, your HBA is able to talk to the controller and get Test-Unit-Ready response, but it isn't able to handle IO.

ideally you would try this with a LUN that isn't in use, so that you could try reading and writing to it.

If direct reads are failing, but manually doing the test-unit-ready returns success

# sg_turs /dev/sdd; echo $?

You could add the following to the devices section of your /etc/multipath.conf to work around the issue

devices {
    device {
        vendor "IBM"
        product "^2145"
        path_checker directio
    }
}

This would tell multipath to check the devices by doing a directio read from them, instead of using the test-unit-ready command.

Note You need to log in before you can comment on or make changes to this bug.