Bug 1699486
| Summary: | Multipath is not grouping paths correctly by prio on controller failover | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | shiva merla <shivakrishna.merla> | ||||||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Lin Li <lilin> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 7.5 | CC: | agk, bmarzins, heinzm, jkachuck, lilin, mknutson, msnitzer, phinchman, prajnoha, rhandlin | ||||||||||
| Target Milestone: | rc | ||||||||||||
| Target Release: | 7.8 | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | device-mapper-multipath-0.4.9-128.el7 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: |
Cause: Multipath was not reloading the device table no usable paths existed
Consequence: Multipath was not correctly grouping paths to respond to changes that occurred while all paths were down
Fix: Multipath now correctly reloads the device table for devices with no usable paths.
Result: Multipath correctly groups paths, even in cases where all paths are fail temporarily.
|
Story Points: | --- | ||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2020-03-31 19:47:09 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1689420, 1715931, 1754591 | ||||||||||||
| Attachments: |
|
||||||||||||
copied sosreport to ftp(dropbox.ftp.com)/incoming as sosreport-hiqa-rhel2-20190412113342.tar.xz Created attachment 1554879 [details]
multipath_ll_v3
Does this work correctly if you only use 2 controllers? It does seem odd that multipathd isn't reloading the pathgroups, but having 3 controllers worth of paths in one pathgroup doesn't seem right either (even though that's how group_by_prio will group them). The problem is that when you go to activate that pathgroup, the kernel with try to run the alua activation function on all the devices in the pathgroup, but they don't all belong to the same controller. Unfortunately, there really isn't a path grouping policy that is designed to separating the path by controller. If I made some test packages that printed debugging information when multipathd tried to reload the pathgroups, would you be able to try it out? Also, could you post the results of running: # multipathd show maps topology # multipathd show paths format "%d %D %n %r" when the device is in the good initial state. Yes, this works correctly if only 2 controllers are in use. But when one of the controller changes to active, doesn't prio group active paths again into separate path group and load? Our array is designed this way to handle certain volume(replicated). upstream array(active/stand-by) and downstream array is in (stand-by/stand-by) mode for synchronous replicated volumes. When volume failover occurs, paths transition for this volume accordingly between upstream/downstream arrays. Yes, we can retry the test with debug packages, please provide them. [root@hiqa-rhel2 ~]# multipathd show maps topology mpathb (2cfce8657e8b1c1b26c9ce9009a22694a) dm-4 Nimble ,Server size=50G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 7:0:2:1 sdaj 66:48 active ready running | |- 7:0:6:1 sdcx 70:80 active ready running | |- 8:0:0:1 sdb 8:16 active ready running | `- 8:0:3:1 sdaw 67:0 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 8:0:5:1 sdcc 69:0 active ghost running |- 8:0:4:1 sdbm 68:0 active ghost running |- 7:0:5:1 sdcg 69:64 active ghost running |- 7:0:3:1 sdaz 67:48 active ghost running |- 7:0:0:1 sdc 8:32 active ghost running |- 7:0:1:1 sdt 65:48 active ghost running |- 7:0:4:1 sdbp 68:48 active ghost running |- 8:0:1:1 sdr 65:16 active ghost running |- 8:0:2:1 sdag 66:0 active ghost running |- 8:0:6:1 sdcr 69:240 active ghost running |- 7:0:7:1 sddj 71:16 active ghost running `- 8:0:7:1 sddk 71:32 active ghost running mpathc (2711dbb0e839ad57a6c9ce9009a22694a) dm-5 Nimble ,Server size=110G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 7:0:2:2 sdal 66:80 active ready running | |- 7:0:6:2 sdcz 70:112 active ready running | |- 8:0:0:2 sdd 8:48 active ready running | `- 8:0:3:2 sday 67:32 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 8:0:5:2 sdcd 69:16 active ghost running |- 7:0:5:2 sdci 69:96 active ghost running |- 8:0:4:2 sdbo 68:32 active ghost running |- 7:0:3:2 sdbb 67:80 active ghost running |- 7:0:0:2 sde 8:64 active ghost running |- 7:0:1:2 sdv 65:80 active ghost running |- 7:0:4:2 sdbr 68:80 active ghost running |- 8:0:1:2 sds 65:32 active ghost running |- 8:0:2:2 sdai 66:32 active ghost running |- 8:0:6:2 sdct 70:16 active ghost running |- 7:0:7:2 sddl 71:48 active ghost running `- 8:0:7:2 sddm 71:64 active ghost running mpathd (213f4788a67729bfa6c9ce9009a22694a) dm-6 Nimble ,Server size=120G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 7:0:2:3 sdan 66:112 active ready running | |- 7:0:6:3 sddb 70:144 active ready running | |- 8:0:0:3 sdf 8:80 active ready running | `- 8:0:3:3 sdba 67:64 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 8:0:5:3 sdcf 69:48 active ghost running |- 8:0:4:3 sdbq 68:64 active ghost running |- 7:0:5:3 sdcl 69:144 active ghost running |- 7:0:3:3 sdbd 67:112 active ghost running |- 7:0:0:3 sdg 8:96 active ghost running |- 7:0:1:3 sdx 65:112 active ghost running |- 7:0:4:3 sdbu 68:128 active ghost running |- 8:0:1:3 sdu 65:64 active ghost running |- 8:0:2:3 sdak 66:64 active ghost running |- 8:0:6:3 sdcu 70:32 active ghost running |- 8:0:7:3 sddo 71:96 active ghost running `- 7:0:7:3 sddn 71:80 active ghost running [root@hiqa-rhel2 ~]# multipathd show paths format "%d %D %n %r" dev dev_t target WWNN target WWPN sdc 8:32 0x56c9ce9081128300 0x56c9ce9081128305 sde 8:64 0x56c9ce9081128300 0x56c9ce9081128305 sdg 8:96 0x56c9ce9081128300 0x56c9ce9081128305 sdi 8:128 0x56c9ce9081128300 0x56c9ce9081128305 sdk 8:160 0x56c9ce9081128300 0x56c9ce9081128305 sdm 8:192 0x56c9ce9081128300 0x56c9ce9081128305 sdo 8:224 0x56c9ce9081128300 0x56c9ce9081128305 sdq 65:0 0x56c9ce9081128300 0x56c9ce9081128305 sdt 65:48 0x56c9ce9081128300 0x56c9ce9081128306 sdv 65:80 0x56c9ce9081128300 0x56c9ce9081128306 sdx 65:112 0x56c9ce9081128300 0x56c9ce9081128306 sdz 65:144 0x56c9ce9081128300 0x56c9ce9081128306 sdab 65:176 0x56c9ce9081128300 0x56c9ce9081128306 sdad 65:208 0x56c9ce9081128300 0x56c9ce9081128306 sdaf 65:240 0x56c9ce9081128300 0x56c9ce9081128306 sdah 66:16 0x56c9ce9081128300 0x56c9ce9081128306 sdaj 66:48 0x56c9ce9081128300 0x56c9ce9081128301 sdal 66:80 0x56c9ce9081128300 0x56c9ce9081128301 sdan 66:112 0x56c9ce9081128300 0x56c9ce9081128301 sdap 66:144 0x56c9ce9081128300 0x56c9ce9081128301 sdar 66:176 0x56c9ce9081128300 0x56c9ce9081128301 sdat 66:208 0x56c9ce9081128300 0x56c9ce9081128301 sdav 66:240 0x56c9ce9081128300 0x56c9ce9081128301 sdax 67:16 0x56c9ce9081128300 0x56c9ce9081128301 sdaz 67:48 0x56c9ce9081128300 0x56c9ce908112830e sdbb 67:80 0x56c9ce9081128300 0x56c9ce908112830e sdbd 67:112 0x56c9ce9081128300 0x56c9ce908112830e sdbf 67:144 0x56c9ce9081128300 0x56c9ce908112830e sdbg 67:160 0x56c9ce9081128300 0x56c9ce908112830e sdbi 67:192 0x56c9ce9081128300 0x56c9ce908112830e sdbk 67:224 0x56c9ce9081128300 0x56c9ce908112830e sdbn 68:16 0x56c9ce9081128300 0x56c9ce908112830e sdbp 68:48 0x56c9ce9081128300 0x56c9ce9081128309 sdbr 68:80 0x56c9ce9081128300 0x56c9ce9081128309 sdbu 68:128 0x56c9ce9081128300 0x56c9ce9081128309 sdbw 68:160 0x56c9ce9081128300 0x56c9ce9081128309 sdby 68:192 0x56c9ce9081128300 0x56c9ce9081128309 sdca 68:224 0x56c9ce9081128300 0x56c9ce9081128309 sdcb 68:240 0x56c9ce9081128300 0x56c9ce9081128309 sdce 69:32 0x56c9ce9081128300 0x56c9ce9081128309 sdcg 69:64 0x56c9ce9081128300 0x56c9ce908112830d sdci 69:96 0x56c9ce9081128300 0x56c9ce908112830d sdcl 69:144 0x56c9ce9081128300 0x56c9ce908112830d sdcn 69:176 0x56c9ce9081128300 0x56c9ce908112830d sdco 69:192 0x56c9ce9081128300 0x56c9ce908112830d sdcq 69:224 0x56c9ce9081128300 0x56c9ce908112830d sdcs 70:0 0x56c9ce9081128300 0x56c9ce908112830d sdcv 70:48 0x56c9ce9081128300 0x56c9ce908112830d sdcx 70:80 0x56c9ce9081128300 0x56c9ce9081128302 sdcz 70:112 0x56c9ce9081128300 0x56c9ce9081128302 sddb 70:144 0x56c9ce9081128300 0x56c9ce9081128302 sddd 70:176 0x56c9ce9081128300 0x56c9ce9081128302 sddf 70:208 0x56c9ce9081128300 0x56c9ce9081128302 sddg 70:224 0x56c9ce9081128300 0x56c9ce9081128302 sddh 70:240 0x56c9ce9081128300 0x56c9ce9081128302 sddi 71:0 0x56c9ce9081128300 0x56c9ce9081128302 sdb 8:16 0x56c9ce9081128300 0x56c9ce9081128302 sdd 8:48 0x56c9ce9081128300 0x56c9ce9081128302 sdf 8:80 0x56c9ce9081128300 0x56c9ce9081128302 sdh 8:112 0x56c9ce9081128300 0x56c9ce9081128302 sdj 8:144 0x56c9ce9081128300 0x56c9ce9081128302 sdl 8:176 0x56c9ce9081128300 0x56c9ce9081128302 sdn 8:208 0x56c9ce9081128300 0x56c9ce9081128302 sdp 8:240 0x56c9ce9081128300 0x56c9ce9081128302 sdr 65:16 0x56c9ce9081128300 0x56c9ce9081128305 sds 65:32 0x56c9ce9081128300 0x56c9ce9081128305 sdu 65:64 0x56c9ce9081128300 0x56c9ce9081128305 sdw 65:96 0x56c9ce9081128300 0x56c9ce9081128305 sdy 65:128 0x56c9ce9081128300 0x56c9ce9081128305 sdaa 65:160 0x56c9ce9081128300 0x56c9ce9081128305 sdac 65:192 0x56c9ce9081128300 0x56c9ce9081128305 sdae 65:224 0x56c9ce9081128300 0x56c9ce9081128305 sdag 66:0 0x56c9ce9081128300 0x56c9ce9081128306 sdai 66:32 0x56c9ce9081128300 0x56c9ce9081128306 sdak 66:64 0x56c9ce9081128300 0x56c9ce9081128306 sdam 66:96 0x56c9ce9081128300 0x56c9ce9081128306 sdao 66:128 0x56c9ce9081128300 0x56c9ce9081128306 sdaq 66:160 0x56c9ce9081128300 0x56c9ce9081128306 sdas 66:192 0x56c9ce9081128300 0x56c9ce9081128306 sdau 66:224 0x56c9ce9081128300 0x56c9ce9081128306 sdaw 67:0 0x56c9ce9081128300 0x56c9ce9081128301 sday 67:32 0x56c9ce9081128300 0x56c9ce9081128301 sdba 67:64 0x56c9ce9081128300 0x56c9ce9081128301 sdbc 67:96 0x56c9ce9081128300 0x56c9ce9081128301 sdbe 67:128 0x56c9ce9081128300 0x56c9ce9081128301 sdbh 67:176 0x56c9ce9081128300 0x56c9ce9081128301 sdbj 67:208 0x56c9ce9081128300 0x56c9ce9081128301 sdbl 67:240 0x56c9ce9081128300 0x56c9ce9081128301 sdbm 68:0 0x56c9ce9081128300 0x56c9ce908112830e sdbo 68:32 0x56c9ce9081128300 0x56c9ce908112830e sdbq 68:64 0x56c9ce9081128300 0x56c9ce908112830e sdbs 68:96 0x56c9ce9081128300 0x56c9ce908112830e sdbt 68:112 0x56c9ce9081128300 0x56c9ce908112830e sdbv 68:144 0x56c9ce9081128300 0x56c9ce908112830e sdbx 68:176 0x56c9ce9081128300 0x56c9ce908112830e sdbz 68:208 0x56c9ce9081128300 0x56c9ce908112830e sdcc 69:0 0x56c9ce9081128300 0x56c9ce908112830d sdcd 69:16 0x56c9ce9081128300 0x56c9ce908112830d sdcf 69:48 0x56c9ce9081128300 0x56c9ce908112830d sdch 69:80 0x56c9ce9081128300 0x56c9ce908112830d sdcj 69:112 0x56c9ce9081128300 0x56c9ce908112830d sdck 69:128 0x56c9ce9081128300 0x56c9ce908112830d sdcm 69:160 0x56c9ce9081128300 0x56c9ce908112830d sdcp 69:208 0x56c9ce9081128300 0x56c9ce908112830d sdcr 69:240 0x56c9ce9081128300 0x56c9ce9081128309 sdct 70:16 0x56c9ce9081128300 0x56c9ce9081128309 sdcu 70:32 0x56c9ce9081128300 0x56c9ce9081128309 sdcw 70:64 0x56c9ce9081128300 0x56c9ce9081128309 sdcy 70:96 0x56c9ce9081128300 0x56c9ce9081128309 sdda 70:128 0x56c9ce9081128300 0x56c9ce9081128309 sddc 70:160 0x56c9ce9081128300 0x56c9ce9081128309 sdde 70:192 0x56c9ce9081128300 0x56c9ce9081128309 sddj 71:16 0x56c9ce9081128300 0x56c9ce908112830a sddk 71:32 0x56c9ce9081128300 0x56c9ce908112830a sddl 71:48 0x56c9ce9081128300 0x56c9ce908112830a sddm 71:64 0x56c9ce9081128300 0x56c9ce908112830a sddo 71:96 0x56c9ce9081128300 0x56c9ce908112830a sddn 71:80 0x56c9ce9081128300 0x56c9ce908112830a sddp 71:112 0x56c9ce9081128300 0x56c9ce908112830a sddq 71:128 0x56c9ce9081128300 0x56c9ce908112830a sdds 71:160 0x56c9ce9081128300 0x56c9ce908112830a sddr 71:144 0x56c9ce9081128300 0x56c9ce908112830a sddt 71:176 0x56c9ce9081128300 0x56c9ce908112830a sddu 71:192 0x56c9ce9081128300 0x56c9ce908112830a sddv 71:208 0x56c9ce9081128300 0x56c9ce908112830a sddw 71:224 0x56c9ce9081128300 0x56c9ce908112830a sddx 71:240 0x56c9ce9081128300 0x56c9ce908112830a sddy 128:0 0x56c9ce9081128300 0x56c9ce908112830a sddz 128:16 0x56c9ce9081128300 0x56c9ce9081128305 sdea 128:32 0x56c9ce9081128300 0x56c9ce9081128306 sdeb 128:48 0x56c9ce9081128300 0x56c9ce9081128301 sdec 128:64 0x56c9ce9081128300 0x56c9ce9081128302 sded 128:80 0x56c9ce9081128300 0x56c9ce9081128302 sdee 128:96 0x56c9ce9081128300 0x56c9ce9081128305 sdef 128:112 0x56c9ce9081128300 0x56c9ce9081128306 sdeg 128:128 0x56c9ce9081128300 0x56c9ce9081128301 sdeh 128:144 0x56c9ce9081128300 0x56c9ce9081128302 sdei 128:160 0x56c9ce9081128300 0x56c9ce9081128302 sdej 128:176 0x56c9ce9081128300 0x56c9ce9081128305 sdek 128:192 0x56c9ce9081128300 0x56c9ce9081128306 sdel 128:208 0x56c9ce9081128300 0x56c9ce9081128301 sdem 128:224 0x56c9ce9081128300 0x56c9ce9081128305 sden 128:240 0x56c9ce9081128300 0x56c9ce9081128306 sdeo 129:0 0x56c9ce9081128300 0x56c9ce9081128301 Test rpms are availble here: http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL7/bz1699486/ These will simple add some debugging messages to help figure out why the pathgroups aren't getting reloaded. Hi, sorry for delay on this, please find the multipath and journal logs attached with the debug packages you have provided. Created attachment 1568260 [details]
multipath output after failover tests with replicated volumes
Created attachment 1568262 [details]
multipath output before failover test
Created attachment 1568263 [details]
dmsetup table output after test
journal logs copied to dropbox.redhat.com/incoming: journalctl_xb_after_1699486.tar.gz Unfortunately, with the multipath verbosity turned up, it looks like some messages got lost May 13 23:05:05 hiqa-rhel2 rsyslogd[7485]: imjournal: 345680 messages lost due to rate-limiting so I can't follow exactly what happened, but I'm pretty sure that I understand why the multipath device isn't reloading when it should, and I've made some new test patches which hopefully will fix that. They're available here http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL7/bz1699486/ If you try to recreate the issue with these patches and the verbosity set back to 2, that would be helpful. Even if the multipath devices still end up in the wrong state, these patches still have the debugging messages, and with the lower verbosity, hopefully messages won't be lost. Hi Ben, sorry for the delay on this. We were able to test with the RPM's you have provided and path group updates are working as expected. Before and after failover of controllers, we see active/ghost paths are updated correctly. Thanks for the fix. Can you let us know when these can be available? Before failover of replicated groups: mpathf (257d7aae7dc32b5e86c9ce9009a22694a) dm-8 Nimble ,Server size=120G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 8:0:2:5 sdap 66:144 active ready running | |- 8:0:1:5 sdy 65:128 active ready running | |- 1:0:0:5 sdk 8:160 active ready running | `- 1:0:1:5 sdaa 65:160 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 8:0:6:5 sddd 70:176 active ghost running |- 1:0:3:5 sdbf 67:144 active ghost running |- 1:0:5:5 sdcm 69:160 active ghost running |- 8:0:4:5 sdbw 68:160 active ghost running |- 8:0:3:5 sdbg 67:160 active ghost running |- 8:0:0:5 sdj 8:144 active ghost running |- 1:0:7:5 sddq 71:128 active ghost running |- 1:0:2:5 sdaq 66:160 active ghost running |- 1:0:4:5 sdbv 68:144 active ghost running |- 1:0:6:5 sddb 70:144 active ghost running |- 8:0:5:5 sdcl 69:144 active ghost running `- 8:0:7:5 sdds 71:160 active ghost running After failover: mpathf (257d7aae7dc32b5e86c9ce9009a22694a) dm-8 Nimble ,Server size=120G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 8:0:6:5 sddd 70:176 active ready running | |- 1:0:3:5 sdbf 67:144 active ready running | |- 1:0:5:5 sdcm 69:160 active ready running | `- 8:0:4:5 sdbw 68:160 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 8:0:2:5 sdap 66:144 active ghost running |- 8:0:1:5 sdy 65:128 active ghost running |- 1:0:0:5 sdk 8:160 active ghost running |- 1:0:1:5 sdaa 65:160 active ghost running |- 8:0:3:5 sdbg 67:160 active ghost running |- 8:0:0:5 sdj 8:144 active ghost running |- 1:0:7:5 sddq 71:128 active ghost running |- 1:0:2:5 sdaq 66:160 active ghost running |- 1:0:4:5 sdbv 68:144 active ghost running |- 1:0:6:5 sddb 70:144 active ghost running |- 8:0:5:5 sdcl 69:144 active ghost running `- 8:0:7:5 sdds 71:160 active ghost running This didn't make rhel-7.7, so it will be in rhel-7.8 Hello shiva, There is no Nimble storage in our lab, would you like to provide test result once the fix is in rhel-7.8? Thanks in advance! This can kinda be tested without any special setup. The easiest way is to run multipathd with the verbosity increased to 3 in /etc/multipath.conf, fail all the paths to a multipath device, wait for the multipathd checker to realized that they have failed, and then run # multipathd reload <mutipath_device> without this fix, you will see something like this in the logs multipathd: <device>: set ACT_NOTHING (no usable path) with this fix, you will see something like this multipathd: <device>: set ACT_RELOAD (forced by user) multipathd: <device>: load table [<new_dm_table>] which means that the map was reloaded. This doesn't trigger the reload the same way as the bug, but it does verify that mutipathd will do reloads when all paths are down. Thanks a lot for the fix! Reproduced on device-mapper-multipath-0.4.9-119.el7
1,[root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-119.el7.x86_64
device-mapper-multipath-0.4.9-119.el7.x86_64
2,[root@storageqe-06 ~]# multipath -ll 360a98000324669436c2b45666c56786f
360a98000324669436c2b45666c56786f dm-3 NETAPP ,LUN
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdm 8:192 active ready running
| `- 4:0:1:1 sdh 8:112 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
|- 1:0:1:1 sdr 65:16 active ready running
`- 4:0:0:1 sdc 8:32 active ready running
3,add the verbosity increased to 3 in /etc/multipath.conf
[root@storageqe-06 ~]# cat /etc/multipath.conf
defaults {
find_multipaths yes
user_friendly_names yes
verbosity 3
}
4,[root@storageqe-06 ~]# service multipathd restart
Redirecting to /bin/systemctl restart multipathd.service
5,[root@storageqe-06 ~]# multipathd show config | grep verbosity
verbosity 3
6,[root@storageqe-06 ~]# multipathd fail path /dev/sdm
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdh
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdr
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdc
ok
7,[root@storageqe-06 ~]# multipath -ll 360a98000324669436c2b45666c56786f
......
360a98000324669436c2b45666c56786f dm-3 NETAPP ,LUN
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=enabled
| |- 1:0:0:1 sdm 8:192 failed ready running
| `- 4:0:1:1 sdh 8:112 failed ready running
`-+- policy='service-time 0' prio=10 status=active
|- 1:0:1:1 sdr 65:16 failed ready running
`- 4:0:0:1 sdc 8:32 failed ready running
.......
8,[root@storageqe-06 ~]# multipathd reload multipath 360a98000324669436c2b45666c56786f
ok
9,check log
Jan 21 22:07:29 storageqe-06 multipathd: 360a98000324669436c2b45666c56786f: set ACT_NOTHING (no usable path)
Verified on device-mapper-multipath-0.4.9-131.el7
1,[root@storageqe-06 ~]# rpm -qa | grep multipath
device-mapper-multipath-devel-0.4.9-131.el7.x86_64
device-mapper-multipath-0.4.9-131.el7.x86_64
device-mapper-multipath-libs-0.4.9-131.el7.x86_64
2,[root@storageqe-06 ~]# multipath -ll 360a98000324669436c2b45666c56786f
360a98000324669436c2b45666c56786f dm-0 NETAPP ,LUN
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdc 8:32 active ready running
| `- 4:0:1:1 sdr 65:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
|- 1:0:1:1 sdh 8:112 active ready running
`- 4:0:0:1 sdm 8:192 active ready running
3,add the verbosity increased to 3 in /etc/multipath.conf
[root@storageqe-06 ~]# cat /etc/multipath.conf
defaults {
find_multipaths yes
user_friendly_names yes
verbosity 3
}
4,[root@storageqe-06 ~]# service multipathd restart
Redirecting to /bin/systemctl restart multipathd.service
5,[root@storageqe-06 ~]# multipathd show config | grep verbosity
verbosity 3
6,[root@storageqe-06 ~]# multipathd fail path /dev/sdm
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdh
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdr
ok
[root@storageqe-06 ~]# multipathd fail path /dev/sdc
ok
7,[root@storageqe-06 ~]# multipath -ll 360a98000324669436c2b45666c56786f
......
360a98000324669436c2b45666c56786f dm-0 NETAPP ,LUN
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdc 8:32 failed ready running
| `- 4:0:1:1 sdr 65:16 failed ready running
`-+- policy='service-time 0' prio=10 status=enabled
|- 1:0:1:1 sdh 8:112 failed ready running
`- 4:0:0:1 sdm 8:192 failed ready running
.......
8,[root@storageqe-06 ~]# multipathd reload multipath 360a98000324669436c2b45666c56786f
ok
9,check log
Jan 21 22:57:20 storageqe-06 multipathd: 360a98000324669436c2b45666c56786f: set ACT_RELOAD (forced by user)
Jan 21 22:57:20 storageqe-06 multipathd: 360a98000324669436c2b45666c56786f: load table [0 4194304 multipath 4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handler 1 alua 2 1 service-time 0 2 1 8:32 1 65:16 1 service-time 0 2 1 8:112 1 8:192 1]
Test result :mutipathd will do reloads when all paths are down.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1066 |
Description of problem: After controller failover, multipath is not switching to new active paths, but I/O is repeatedly issued on stand-by paths. Below is the map state after controller failover. Active paths(sdal, sdcz, sdd, sday) should be grouped with prio 50 after failover, but that path group also has other stand-by paths. mpathc (2711dbb0e839ad57a6c9ce9009a22694a) dm-5 Nimble ,Server size=110G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=1 status=enabled | |- 7:0:3:2 sdbb 67:80 active ghost running | |- 7:0:5:2 sdci 69:96 active ghost running | |- 8:0:4:2 sdbo 68:32 active ghost running | `- 8:0:5:2 sdcd 69:16 active ghost running `-+- policy='service-time 0' prio=20 status=active |- 7:0:0:2 sde 8:64 failed ghost running |- 7:0:1:2 sdv 65:80 failed ghost running |- 7:0:2:2 sdal 66:80 active ready running |- 7:0:4:2 sdbr 68:80 failed ghost running |- 7:0:6:2 sdcz 70:112 active ready running |- 8:0:0:2 sdd 8:48 active ready running |- 8:0:1:2 sds 65:32 failed ghost running |- 8:0:2:2 sdai 66:32 failed ghost running |- 8:0:3:2 sday 67:32 active ready running `- 8:0:6:2 sdct 70:16 failed ghost running Version-Release number of selected component (if applicable): # rpm -qa | grep device-mapper-multipath device-mapper-multipath-libs-0.4.9-119.el7.x86_64 device-mapper-multipath-0.4.9-119.el7.x86_64 # uname -a Linux hiqa-rhel2 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux How reproducible: consistently on Oracle disks. Steps to Reproduce: 1. Connect few volumes to host 2. Create ASM diskgroup, and bring up Oracle DB 3. Failover controller Actual results: paths are not grouped correctly by active and stand-by TGP state. active/stand-by paths are part of same path group causing continuous failures due to 2/4/b CC and path re-instates. Expected results: active and stand-by paths in different path groups Additional info: Initial State: ( we have 3 controllers in stand-by and one in active, hence more stand-by paths). mpathc (2711dbb0e839ad57a6c9ce9009a22694a) dm-5 Nimble ,Server size=110G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 8:0:0:2 sdd 8:48 active ready running | |- 7:0:2:2 sdal 66:80 active ready running | |- 7:0:6:2 sdcz 70:112 active ready running | `- 8:0:3:2 sday 67:32 active ready running `-+- policy='service-time 0' prio=1 status=enabled |- 7:0:3:2 sdbb 67:80 active ghost running |- 7:0:5:2 sdci 69:96 active ghost running |- 8:0:4:2 sdbo 68:32 active ghost running |- 8:0:5:2 sdcd 69:16 active ghost running |- 7:0:0:2 sde 8:64 active ghost running |- 7:0:1:2 sdv 65:80 active ghost running |- 7:0:4:2 sdbr 68:80 active ghost running |- 8:0:1:2 sds 65:32 active ghost running |- 8:0:2:2 sdai 66:32 active ghost running `- 8:0:6:2 sdct 70:16 active ghost running One controller switch from active to stand-by and other controller switches vice-versa. After this, as we can see below, path state has changed from ghost to active/ready, but path group has not changed. mpathc (2711dbb0e839ad57a6c9ce9009a22694a) dm-5 Nimble ,Server size=110G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=1 status=enabled | |- 7:0:2:2 sdal 66:80 active ghost running | |- 7:0:6:2 sdcz 70:112 active ghost running | |- 8:0:0:2 sdd 8:48 active ghost running | `- 8:0:3:2 sday 67:32 active ghost running `-+- policy='service-time 0' prio=20 status=active |- 7:0:0:2 sde 8:64 active ghost running |- 7:0:1:2 sdv 65:80 failed ghost running |- 7:0:3:2 sdbb 67:80 active ready running |- 7:0:4:2 sdbr 68:80 failed ghost running |- 7:0:5:2 sdci 69:96 active ready running |- 8:0:1:2 sds 65:32 active ghost running |- 8:0:2:2 sdai 66:32 failed ghost running |- 8:0:4:2 sdbo 68:32 active ready running |- 8:0:5:2 sdcd 69:16 active ready running `- 8:0:6:2 sdct 70:16 active ghost running Multipath settings: # cat /etc/multipath.conf defaults { user_friendly_names yes find_multipaths no verbosity 3 } blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" device { vendor ".*" product ".*" } } blacklist_exceptions { device { vendor "Nimble" product "Server" } } devices { device { vendor "Nimble" product "Server" path_grouping_policy group_by_prio prio "alua" hardware_handler "1 alua" path_selector "service-time 0" path_checker tur no_path_retry 30 failback immediate dev_loss_tmo infinity fast_io_fail_tmo 5 rr_weight uniform rr_min_io_rq 1 } } sosreport will be attached, time around Apr 12 11:38