Hide Forgot
Description of problem: Is there a way to view path_faults on a per drive basis? For example, I see my mpath has 4 path faults: [root@jet21:~]# multipathd list multipaths stats name path_faults switch_grp map_loads total_q_time q_timeouts 35000c50084fc07f7 4 0 2 0 0 ... but I'd like to see which of these two drives was faulting: [root@jet21:~]# multipath -ll 35000c50084fc07f7 dm-138 SEAGATE ,STxxxxxxxx size=7.3T features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 0:0:144:0 sdem 128:224 active ready running `- 11:0:144:0 sdkq 66:480 active ready running This would be very useful to help us diagnose which drives are failing. Version-Release number of selected component (if applicable): RHEL 7.4 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Currently, the only way to get this information is to run "dmsetup status", and grab the information from there. # dmsetup status mpathc 0 488120320 multipath 2 0 0 0 2 1 A 0 1 2 8:32 A 0 0 1 E 0 1 2 8:64 A 1 0 1 This is a multipath device with two paths, 8:32 and 8:64. The letter immediately following the path major:minor tells the path state (either A for active or F for failed). Both paths are currently active. The number after that is the number of times the path has failed. So, 8:32 has never failed, and 8:64 has failed one time. I can add this information to the path format wildcards (probably as %x, to match the multipath wildcard, so that you could display it with multipathd's formatted output, using something like # multipathd show paths format "%d %x"
> I can add this information to the path format wildcards (probably as %x, to > match the multipath wildcard, so that you could display it with multipathd's > formatted output, using something like That would be amazing! I see that 'multipath' has both: %x failures %0 path_faults Could you tell me the difference between the two? My guess is that we'd be interested in both. Background: We regularly monitor multipath stats with splunk, and this would make it easy to see which of the SAS links to our disk is bad. I'd also like to integrate it into the ZFS commands so that we could view mpath stats inline with the disk status, similar to what we've done with other stats: https://github.com/zfsonlinux/zfs/pull/7245 https://github.com/zfsonlinux/zfs/pull/7178
failures tells the number of times a multipath device has lost all of its paths, and stopped queueing IO. these are cases where IO going to the multipath device could get failed up to a higher layer, such as the filesystem. path_faults is the number of times that any of a multipath device's paths has switched from the active to the failed state. If you think %0 would make more sense for the path's failure wildcard, I'm fine with using either.
Yea, it sounds like path_faults (%0) would be better for 'paths', since it is per-slave path.
Path failures are now viewable using the %0 wildcard.
Verified on device-mapper-multipath-0.4.9-122.el7 1, [root@storageqe-06 ~]# rpm -qa | grep multipath device-mapper-multipath-libs-0.4.9-122.el7.x86_64 device-mapper-multipath-0.4.9-122.el7.x86_64 2,[root@storageqe-06 ~]# multipath -ll 360a98000324669436c2b45666c56786d dm-2 NETAPP ,LUN size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 1:0:0:0 sdl 8:176 active ready running | `- 4:0:1:0 sdg 8:96 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:1:0 sdq 65:0 active ready running `- 4:0:0:0 sdb 8:16 active ready running 360a98000324669436c2b45666c567875 dm-0 NETAPP ,LUN size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 1:0:0:4 sdp 8:240 active ready running | `- 4:0:1:4 sdk 8:160 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:1:4 sdu 65:64 active ready running `- 4:0:0:4 sdf 8:80 active ready running 360a98000324669436c2b45666c567873 dm-1 NETAPP ,LUN size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 1:0:0:3 sdo 8:224 active ready running | `- 4:0:1:3 sdj 8:144 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:1:3 sdt 65:48 active ready running `- 4:0:0:3 sde 8:64 active ready running 360a98000324669436c2b45666c567871 dm-3 NETAPP ,LUN size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 1:0:0:2 sdn 8:208 active ready running | `- 4:0:1:2 sdi 8:128 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:1:2 sds 65:32 active ready running `- 4:0:0:2 sdd 8:48 active ready running 360a98000324669436c2b45666c56786f dm-6 NETAPP ,LUN size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 1:0:0:1 sdm 8:192 active ready running | `- 4:0:1:1 sdh 8:112 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:1:1 sdr 65:16 active ready running `- 4:0:0:1 sdc 8:32 active ready running 3, [root@storageqe-06 ~]# multipathd list multipaths stats name path_faults switch_grp map_loads total_q_time q_timeouts 360a98000324669436c2b45666c56786d 0 0 1 0 0 360a98000324669436c2b45666c56786f 0 0 1 0 0 360a98000324669436c2b45666c567871 0 0 1 0 0 360a98000324669436c2b45666c567873 0 0 1 0 0 360a98000324669436c2b45666c567875 0 0 1 0 0 4,[root@storageqe-06 ~]# multipathd show paths format "%0" failures 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3236