Bug 1554516

Summary: multipathd should show per-disk path_faults
Product: Red Hat Enterprise Linux 7 Reporter: Tony Hutter <hutter2>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Lin Li <lilin>
Severity: low Docs Contact: Steven J. Levine <slevine>
Priority: unspecified    
Version: 7.4CC: agk, bmarzins, heinzm, jbrassow, lilin, msnitzer, prajnoha, rhandlin
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: device-mapper-multipath-0.4.9-120.el7 Doc Type: Release Note
Doc Text:
New `%0` wildcard added for the "multipathd show paths format" command to show path failures The "multipathd show paths format" command now supports the `%0` wildcard to display path failures. Support for this wildcard makes it easier for users to track which paths have been failing in a multipath device.
Story Points: ---
Clone Of:
: 1627884 (view as bug list) Environment:
Last Closed: 2018-10-30 11:27:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1627884    

Description Tony Hutter 2018-03-12 20:18:26 UTC
Description of problem:

Is there a way to view path_faults on a per drive basis?  For example, I see my mpath has 4 path faults:

[root@jet21:~]# multipathd list multipaths stats
name              path_faults switch_grp map_loads total_q_time q_timeouts
35000c50084fc07f7 4           0          2         0            0         
   
... but I'd like to see which of these two drives was faulting:

[root@jet21:~]# multipath -ll        
35000c50084fc07f7 dm-138 SEAGATE ,STxxxxxxxx   
size=7.3T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 0:0:144:0  sdem 128:224 active ready running
  `- 11:0:144:0 sdkq 66:480  active ready running


This would be very useful to help us diagnose which drives are failing.

Version-Release number of selected component (if applicable):
RHEL 7.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ben Marzinski 2018-03-13 20:11:21 UTC
Currently, the only way to get this information is to run "dmsetup status", and grab the information from there.

# dmsetup status mpathc
0 488120320 multipath 2 0 0 0 2 1 A 0 1 2 8:32 A 0 0 1 E 0 1 2 8:64 A 1 0 1

This is a multipath device with two paths, 8:32 and 8:64. The letter immediately following the path major:minor tells the path state (either A for active or F for failed). Both paths are currently active. The number after that is the number of times the path has failed. So, 8:32 has never failed, and 8:64 has failed one time.

I can add this information to the path format wildcards (probably as %x, to match the multipath wildcard, so that you could display it with multipathd's formatted output, using something like 

# multipathd show paths format "%d %x"

Comment 3 Tony Hutter 2018-03-13 20:42:12 UTC
> I can add this information to the path format wildcards (probably as %x, to
> match the multipath wildcard, so that you could display it with multipathd's
> formatted output, using something like 

That would be amazing!  I see that 'multipath' has both:

%x  failures
%0  path_faults

Could you tell me the difference between the two?  My guess is that we'd be interested in both.

Background:  We regularly monitor multipath stats with splunk, and this would make it easy to see which of the SAS links to our disk is bad.  I'd also like to integrate it into the ZFS commands so that we could view mpath stats inline with the disk status, similar to what we've done with other stats:

https://github.com/zfsonlinux/zfs/pull/7245
https://github.com/zfsonlinux/zfs/pull/7178

Comment 4 Ben Marzinski 2018-03-14 18:37:32 UTC
failures tells the number of times a multipath device has lost all of its paths, and stopped queueing IO. these are cases where IO going to the multipath device could get failed up to a higher layer, such as the filesystem.

path_faults is the number of times that any of a multipath device's paths has switched from the active to the failed state. If you think %0 would make more sense for the path's failure wildcard, I'm fine with using either.

Comment 5 Tony Hutter 2018-03-14 18:48:07 UTC
Yea, it sounds like path_faults (%0) would be better for 'paths', since it is per-slave path.

Comment 6 Ben Marzinski 2018-06-13 21:42:15 UTC
Path failures are now viewable using the %0 wildcard.

Comment 8 Lin Li 2018-08-09 07:04:54 UTC
Verified on device-mapper-multipath-0.4.9-122.el7
1, [root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-122.el7.x86_64
device-mapper-multipath-0.4.9-122.el7.x86_64

2,[root@storageqe-06 ~]# multipath -ll
360a98000324669436c2b45666c56786d dm-2 NETAPP  ,LUN             
size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:0 sdl 8:176 active ready running
| `- 4:0:1:0 sdg 8:96  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:0 sdq 65:0  active ready running
  `- 4:0:0:0 sdb 8:16  active ready running
360a98000324669436c2b45666c567875 dm-0 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:4 sdp 8:240 active ready running
| `- 4:0:1:4 sdk 8:160 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:4 sdu 65:64 active ready running
  `- 4:0:0:4 sdf 8:80  active ready running
360a98000324669436c2b45666c567873 dm-1 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:3 sdo 8:224 active ready running
| `- 4:0:1:3 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:3 sdt 65:48 active ready running
  `- 4:0:0:3 sde 8:64  active ready running
360a98000324669436c2b45666c567871 dm-3 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:2 sdn 8:208 active ready running
| `- 4:0:1:2 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:2 sds 65:32 active ready running
  `- 4:0:0:2 sdd 8:48  active ready running
360a98000324669436c2b45666c56786f dm-6 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdm 8:192 active ready running
| `- 4:0:1:1 sdh 8:112 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:1 sdr 65:16 active ready running
  `- 4:0:0:1 sdc 8:32  active ready running

3, [root@storageqe-06 ~]# multipathd list multipaths stats
name                              path_faults switch_grp map_loads total_q_time q_timeouts
360a98000324669436c2b45666c56786d 0           0          1         0            0         
360a98000324669436c2b45666c56786f 0           0          1         0            0         
360a98000324669436c2b45666c567871 0           0          1         0            0         
360a98000324669436c2b45666c567873 0           0          1         0            0         
360a98000324669436c2b45666c567875 0           0          1         0            0   

4,[root@storageqe-06 ~]# multipathd show paths format "%0"
failures
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0

Comment 13 errata-xmlrpc 2018-10-30 11:27:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3236