Bug 1554516 - multipathd should show per-disk path_faults
Summary: multipathd should show per-disk path_faults
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: device-mapper-multipath
Version: 7.4
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Lin Li
Steven J. Levine
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1627884
TreeView+ depends on / blocked
 
Reported: 2018-03-12 20:18 UTC by Tony Hutter
Modified: 2018-10-30 11:28 UTC (History)
8 users (show)

(edit)
New `%0` wildcard added for the "multipathd show paths format" command to show path failures

The "multipathd show paths format" command now supports the `%0` wildcard to display path failures. Support for this wildcard makes it easier for users to track which paths have been failing in a multipath device.
Clone Of:
: 1627884 (view as bug list)
(edit)
Last Closed: 2018-10-30 11:27:28 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3236 None None None 2018-10-30 11:28 UTC

Description Tony Hutter 2018-03-12 20:18:26 UTC
Description of problem:

Is there a way to view path_faults on a per drive basis?  For example, I see my mpath has 4 path faults:

[root@jet21:~]# multipathd list multipaths stats
name              path_faults switch_grp map_loads total_q_time q_timeouts
35000c50084fc07f7 4           0          2         0            0         
   
... but I'd like to see which of these two drives was faulting:

[root@jet21:~]# multipath -ll        
35000c50084fc07f7 dm-138 SEAGATE ,STxxxxxxxx   
size=7.3T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 0:0:144:0  sdem 128:224 active ready running
  `- 11:0:144:0 sdkq 66:480  active ready running


This would be very useful to help us diagnose which drives are failing.

Version-Release number of selected component (if applicable):
RHEL 7.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ben Marzinski 2018-03-13 20:11:21 UTC
Currently, the only way to get this information is to run "dmsetup status", and grab the information from there.

# dmsetup status mpathc
0 488120320 multipath 2 0 0 0 2 1 A 0 1 2 8:32 A 0 0 1 E 0 1 2 8:64 A 1 0 1

This is a multipath device with two paths, 8:32 and 8:64. The letter immediately following the path major:minor tells the path state (either A for active or F for failed). Both paths are currently active. The number after that is the number of times the path has failed. So, 8:32 has never failed, and 8:64 has failed one time.

I can add this information to the path format wildcards (probably as %x, to match the multipath wildcard, so that you could display it with multipathd's formatted output, using something like 

# multipathd show paths format "%d %x"

Comment 3 Tony Hutter 2018-03-13 20:42:12 UTC
> I can add this information to the path format wildcards (probably as %x, to
> match the multipath wildcard, so that you could display it with multipathd's
> formatted output, using something like 

That would be amazing!  I see that 'multipath' has both:

%x  failures
%0  path_faults

Could you tell me the difference between the two?  My guess is that we'd be interested in both.

Background:  We regularly monitor multipath stats with splunk, and this would make it easy to see which of the SAS links to our disk is bad.  I'd also like to integrate it into the ZFS commands so that we could view mpath stats inline with the disk status, similar to what we've done with other stats:

https://github.com/zfsonlinux/zfs/pull/7245
https://github.com/zfsonlinux/zfs/pull/7178

Comment 4 Ben Marzinski 2018-03-14 18:37:32 UTC
failures tells the number of times a multipath device has lost all of its paths, and stopped queueing IO. these are cases where IO going to the multipath device could get failed up to a higher layer, such as the filesystem.

path_faults is the number of times that any of a multipath device's paths has switched from the active to the failed state. If you think %0 would make more sense for the path's failure wildcard, I'm fine with using either.

Comment 5 Tony Hutter 2018-03-14 18:48:07 UTC
Yea, it sounds like path_faults (%0) would be better for 'paths', since it is per-slave path.

Comment 6 Ben Marzinski 2018-06-13 21:42:15 UTC
Path failures are now viewable using the %0 wildcard.

Comment 8 Lin Li 2018-08-09 07:04:54 UTC
Verified on device-mapper-multipath-0.4.9-122.el7
1, [root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-122.el7.x86_64
device-mapper-multipath-0.4.9-122.el7.x86_64

2,[root@storageqe-06 ~]# multipath -ll
360a98000324669436c2b45666c56786d dm-2 NETAPP  ,LUN             
size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:0 sdl 8:176 active ready running
| `- 4:0:1:0 sdg 8:96  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:0 sdq 65:0  active ready running
  `- 4:0:0:0 sdb 8:16  active ready running
360a98000324669436c2b45666c567875 dm-0 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:4 sdp 8:240 active ready running
| `- 4:0:1:4 sdk 8:160 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:4 sdu 65:64 active ready running
  `- 4:0:0:4 sdf 8:80  active ready running
360a98000324669436c2b45666c567873 dm-1 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:3 sdo 8:224 active ready running
| `- 4:0:1:3 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:3 sdt 65:48 active ready running
  `- 4:0:0:3 sde 8:64  active ready running
360a98000324669436c2b45666c567871 dm-3 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:2 sdn 8:208 active ready running
| `- 4:0:1:2 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:2 sds 65:32 active ready running
  `- 4:0:0:2 sdd 8:48  active ready running
360a98000324669436c2b45666c56786f dm-6 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdm 8:192 active ready running
| `- 4:0:1:1 sdh 8:112 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:1 sdr 65:16 active ready running
  `- 4:0:0:1 sdc 8:32  active ready running

3, [root@storageqe-06 ~]# multipathd list multipaths stats
name                              path_faults switch_grp map_loads total_q_time q_timeouts
360a98000324669436c2b45666c56786d 0           0          1         0            0         
360a98000324669436c2b45666c56786f 0           0          1         0            0         
360a98000324669436c2b45666c567871 0           0          1         0            0         
360a98000324669436c2b45666c567873 0           0          1         0            0         
360a98000324669436c2b45666c567875 0           0          1         0            0   

4,[root@storageqe-06 ~]# multipathd show paths format "%0"
failures
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0

Comment 13 errata-xmlrpc 2018-10-30 11:27:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3236


Note You need to log in before you can comment on or make changes to this bug.