1554516 – multipathd should show per-disk path_faults

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1554516 - multipathd should show per-disk path_faults

Summary: multipathd should show per-disk path_faults

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Ben Marzinski
QA Contact:	Lin Li
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	1627884
TreeView+	depends on / blocked

Reported:	2018-03-12 20:18 UTC by Tony Hutter
Modified:	2021-09-03 12:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:	device-mapper-multipath-0.4.9-120.el7
Doc Type:	Release Note
Doc Text:	New `%0` wildcard added for the "multipathd show paths format" command to show path failures The "multipathd show paths format" command now supports the `%0` wildcard to display path failures. Support for this wildcard makes it easier for users to track which paths have been failing in a multipath device.
Clone Of:
Clones:	1627884 (view as bug list)
Environment:
Last Closed:	2018-10-30 11:27:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3236	0	None	None	None	2018-10-30 11:28:35 UTC

Description Tony Hutter 2018-03-12 20:18:26 UTC

Description of problem:

Is there a way to view path_faults on a per drive basis?  For example, I see my mpath has 4 path faults:

[root@jet21:~]# multipathd list multipaths stats
name              path_faults switch_grp map_loads total_q_time q_timeouts
35000c50084fc07f7 4           0          2         0            0         
   
... but I'd like to see which of these two drives was faulting:

[root@jet21:~]# multipath -ll        
35000c50084fc07f7 dm-138 SEAGATE ,STxxxxxxxx   
size=7.3T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 0:0:144:0  sdem 128:224 active ready running
  `- 11:0:144:0 sdkq 66:480  active ready running


This would be very useful to help us diagnose which drives are failing.

Version-Release number of selected component (if applicable):
RHEL 7.4

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ben Marzinski 2018-03-13 20:11:21 UTC

Currently, the only way to get this information is to run "dmsetup status", and grab the information from there.

# dmsetup status mpathc
0 488120320 multipath 2 0 0 0 2 1 A 0 1 2 8:32 A 0 0 1 E 0 1 2 8:64 A 1 0 1

This is a multipath device with two paths, 8:32 and 8:64. The letter immediately following the path major:minor tells the path state (either A for active or F for failed). Both paths are currently active. The number after that is the number of times the path has failed. So, 8:32 has never failed, and 8:64 has failed one time.

I can add this information to the path format wildcards (probably as %x, to match the multipath wildcard, so that you could display it with multipathd's formatted output, using something like 

# multipathd show paths format "%d %x"

Comment 3 Tony Hutter 2018-03-13 20:42:12 UTC

> I can add this information to the path format wildcards (probably as %x, to
> match the multipath wildcard, so that you could display it with multipathd's
> formatted output, using something like 

That would be amazing!  I see that 'multipath' has both:

%x  failures
%0  path_faults

Could you tell me the difference between the two?  My guess is that we'd be interested in both.

Background:  We regularly monitor multipath stats with splunk, and this would make it easy to see which of the SAS links to our disk is bad.  I'd also like to integrate it into the ZFS commands so that we could view mpath stats inline with the disk status, similar to what we've done with other stats:

https://github.com/zfsonlinux/zfs/pull/7245
https://github.com/zfsonlinux/zfs/pull/7178

Comment 4 Ben Marzinski 2018-03-14 18:37:32 UTC

failures tells the number of times a multipath device has lost all of its paths, and stopped queueing IO. these are cases where IO going to the multipath device could get failed up to a higher layer, such as the filesystem.

path_faults is the number of times that any of a multipath device's paths has switched from the active to the failed state. If you think %0 would make more sense for the path's failure wildcard, I'm fine with using either.

Comment 5 Tony Hutter 2018-03-14 18:48:07 UTC

Yea, it sounds like path_faults (%0) would be better for 'paths', since it is per-slave path.

Comment 6 Ben Marzinski 2018-06-13 21:42:15 UTC

Path failures are now viewable using the %0 wildcard.

Comment 8 Lin Li 2018-08-09 07:04:54 UTC

Verified on device-mapper-multipath-0.4.9-122.el7
1, [root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-122.el7.x86_64
device-mapper-multipath-0.4.9-122.el7.x86_64

2,[root@storageqe-06 ~]# multipath -ll
360a98000324669436c2b45666c56786d dm-2 NETAPP  ,LUN             
size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:0 sdl 8:176 active ready running
| `- 4:0:1:0 sdg 8:96  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:0 sdq 65:0  active ready running
  `- 4:0:0:0 sdb 8:16  active ready running
360a98000324669436c2b45666c567875 dm-0 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:4 sdp 8:240 active ready running
| `- 4:0:1:4 sdk 8:160 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:4 sdu 65:64 active ready running
  `- 4:0:0:4 sdf 8:80  active ready running
360a98000324669436c2b45666c567873 dm-1 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:3 sdo 8:224 active ready running
| `- 4:0:1:3 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:3 sdt 65:48 active ready running
  `- 4:0:0:3 sde 8:64  active ready running
360a98000324669436c2b45666c567871 dm-3 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:2 sdn 8:208 active ready running
| `- 4:0:1:2 sdi 8:128 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:2 sds 65:32 active ready running
  `- 4:0:0:2 sdd 8:48  active ready running
360a98000324669436c2b45666c56786f dm-6 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdm 8:192 active ready running
| `- 4:0:1:1 sdh 8:112 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:1:1 sdr 65:16 active ready running
  `- 4:0:0:1 sdc 8:32  active ready running

3, [root@storageqe-06 ~]# multipathd list multipaths stats
name                              path_faults switch_grp map_loads total_q_time q_timeouts
360a98000324669436c2b45666c56786d 0           0          1         0            0         
360a98000324669436c2b45666c56786f 0           0          1         0            0         
360a98000324669436c2b45666c567871 0           0          1         0            0         
360a98000324669436c2b45666c567873 0           0          1         0            0         
360a98000324669436c2b45666c567875 0           0          1         0            0   

4,[root@storageqe-06 ~]# multipathd show paths format "%0"
failures
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0       
0

Comment 13 errata-xmlrpc 2018-10-30 11:27:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3236

Note You need to log in before you can comment on or make changes to this bug.