+++ This bug was initially created as a clone of Bug #155738 +++ Some scenarios might wish to minimize the switch of priority groups as far as possible and not have multipath-tools switch priority groups automatically; this behaviour must be configurable. Also, switching back the priority group to the preferred one if paths in it become available again should be configurable. In both scenarios, it might be desireable to not only have the option to fully disable it, but also to have a timer before actually switching, and allow the situation to stabilize first. -- Additional comment from christophe.varoqui on 2005-04-22 18:45 EST -- What we have today as a switch pg policy is what could be described as "opportunistic", ie if another is seen as better switch to it even if the current one is in a working state. What you suggest to add would be a "last_resort" policy. I would need to add 2 new struct multipath fields : - int switchover_policy (opportunistic or last_resort) - int switchback_policy (true or false) These properties would be definable in the devices{} and multipaths{} config blobs. That definitely doable. Would it fit your needs ? -- Additional comment from lmb on 2005-04-25 08:27 EST -- Yes. However I'd propose the following scheme: pg_select_policy (max_paths|highest_prio) max_paths would select the PG with the most paths (like right now); highest_prio would switch to the one with the highest priority and available paths. pg_switching_policy (<timer>|controlled) <timer> would switch automatically when the pg_select_policy suggests a new PG (in response to some event) after <timer> seconds; "0" would switch immediately. This is to give the system some time to stabilize - ie, most of the time you'd not want to switch _immediately_, but give it 2-3 path checker iterations to make sure the path is there to stay. "controlled" would only switch to the selected PG (or in fact to any other PG) when the admin tells us to (or an error forces us). (Which relates to bug #155546 ;-) (automated and controlled failback/switch-over are well established terms in the HA clustering world for resource migration behaviour, so it makes sense to reuse the concepts here.) Does that make sense?
That dm-multipath gratuitously switch paths has in my opinion to be a rather important bug. I have a what I imagine to be a fairly common setup with dual independent fabrics, looking like this: HOST 1 HOST 2 | | | | | | +------C----------+ | A | | D | +--------B--------+ | | | | | SWITCH 1 SWITCH 2 | | +- SP A (ARRAY) SP B -+ Hosts are SunFire X4100es, running RHEL4 U3, with a dual-port QLogic HBA identifying itself as a QLA2422 (but knowing QLogic, that's probably a lie), switches are McData Sphereon 4x00es, and the array is an EMC² CX200, which is an active/passive unit - only one SP (controller) accepts I/O at one time. Oh, and: device-mapper-multipath-0.4.5-12.0.RHEL4 kernel-smp-2.6.9-34.EL (Both the most recent at the time of writing.) multipath.conf contains: device { vendor "DGC " product "*" hardware_handler "1 emc" features "1 queue_if_no_path" path_checker emc_clariion path_grouping_policy failover failback manual } I have been unable to make it NOT switch paths at startup. It insists on changing to SP A (possibly because that one has the lesser minor number? SP A is sdb/8:16 and SP B sdc/8:32). Even when SP A was already the active controller, it still logs that it sends the "trespass" command to move the LU (I assume it's to SP A, because the LU doesn't actually move anywhere). Output from multipath -ll, in case it's interesting: fjas (36006017cd00e0000414744ed98e1d911) [size=1 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] \_ round-robin 0 [prio=1][active] \_ 21:0:0:0 sdb 8:16 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 22:0:0:0 sdc 8:32 [active][ready] Switching paths at startup isn't really problematic in a one-host scenario, and I could have used prio_callout "/sbin/mpath_prio_emc /dev/%n" in order to make it prefer the PG corresponding to the SP marked as "default" in the array (curiously enough even when testing using this setup dm-multipath ends up ALWAYS moving the LU to SP A, just like without any prio_callout, but moves it back again to SP B at first I/O if that is the SP marked as default in the array configuration). However, this behaviour is quite problematic for clusters with shared storage, like mine. Imagine that path C in my drawing fails. dm-multipath on host 2 will correctly fail the path and move the LU to SP B if necessary, and host 1 will follow. However, at this point I cannot reboot host 1 without adversely affecting host 2, because host 1 will move the LU to SP A at init - in effect yanking the only healthy path host 2 has left. Bad, bad, bad. If you're lucky enough that queue_if_no_path is the desired behaviour, what you get is a few seconds (that feels like an eternity) where everything is blocked, before multipathd realizes that the path is alive after all, and reinstates it. (I have not even dared to find out how Cluster Manager will handle this, if its shared state or tie breaker is affected by this.) If queue_if_no_path isn't for you, well, you're screwed. So much for fault tolerance. Please someone tell me that I'm wrong, that this is UBD - me being unable to configure it correctly. There must be a way to avoid this behaviour...right? Tore Anderson
You should only use the group_by_prio path group priority policy with a CLARiiON otherwise you run the risk of (1) incurring unecessary path re- assignment operations and (2) missing out on potential path load sharing opportunies (if you had multiple active paths). This is your initial problem. You are getting the behavior you are seeing with the failover pg policy because the path group selection code chooses the first path group it finds (which is also the first one created) amongst path groups which have the same priority. This path group may or may not be associated with the current service processor owning the logical unit (I assume the current SP is already associated with the default owning SP). So when you actually try to do IO, the odds are 50/50 that you will need to switch to the other path group. These are not good odds. As far as the larger problem is concerned, no solution I've seen yet addresses the need to adapt the failback policy based on the current state. I see the merit in a new adaptive pg_failback behavior where an automated path group failback operation would only be performed by the multipath-tools software when the most recent path group failover (pg_init of a new path group) operation was also initiated by the multipath-tools software. Conversely, the automated path group failback operation would not be initiated by multipath-tools if the path group failover operation was initiated by another host or software on that host which is external to the multipath-tools software. None of the current path group failback policies nor the ones described so far on this bugzilla provide this adaptive behavior -- they are all static policies. For active/passive storage arrays like the CLARiiON, there is real benefit (a static load balancing/sharing of active paths to logical units across the two service processors) in always seeking the opportunistic policy UNLESS the path group was changed manually or via another host in the first place. It is best if our code can detect and adapt to this situation and not require that the failback policy be changed beforehand in these situations. I've been trying to develop a solution which will prevent unnecessary trespass operations in multi-host configurations. At least for a CLARiiON, these configurations are not restricted to clusters but also involve (1) remote replication use cases and (2) cases where a storage system's default owning service proccessor is changed by storage management software (e.g., EMC Navishphere or Control Center for the CLARiiON). I've subbmitted code upstream to Christophe in the form of a new failback policy which attempted to approximate this behavior without requiring kernel changes. The new failback policy was similar to the current opportunistic one but would only possibly initiate failback when either (1) the priority of a path changed or (2) a path group with all failed paths had a path transition from failed to active. This policy would only fail to handle the use case you have described if path B for host 1 also failed and got reactivated after the failure of path C on host 2. Since this use case involves a double failure, it may be reasonable to live with. I think Christophe is rejecting this code sumbission because he thinks the problem is specific to CLARiiON and should be solved in CLARiiON specific code (e.g., path priority handler). I do not agree.
Ed, Thanks for your feedback. I tried: device { vendor "DGC " product "*" hardware_handler "1 emc" features "1 queue_if_no_path" path_checker emc_clariion path_grouping_policy group_by_prio failback manual } As expected, this resulted in one PG with both paths, with the path to the passive controller flapping (ie. multipathd reinstates it because it's healthy, but I/O to it fails, so dm-multipath fails it. Repeat ad infinitum). So I added «prio_callout "/sbin/mpath_prio_emc /dev/%n"», which I expect needs to be there anyway anyway. multipath -ll from when SP A is marked as being the default: fjas (36006017cd00e0000414744ed98e1d911) [size=1 GB][features="1 queue_if_no_path"][hwhandler="1 emc"] \_ round-robin 0 [prio=1][enabled] \_ 25:0:0:0 sdb 8:16 [active][ready] \_ round-robin 0 [enabled] \_ 26:0:0:0 sdc 8:32 [active][ready] I've tested various situations to see what happened: Initial SP Default SP Trespassed at init? Trespassed at first I/O? A A Yes (no effect) Yes (no effect) B A Yes (moved to SP A) Yes (no effect) A B Yes (no effect) Yes (moved to SP B) B B Yes (moved to SP A) Yes (moved to SP B) So no matter what I do the LU ends up getting moved to SP A, but moved back again to SP B if that's the default when I send some I/O there. Hardly desireable behaviour. Just for the heck of it, I tried filing out the table using the failover grouping policy instead, and the results are exactly the same (should I have expected any difference?). I think the reason why it always winds up at SP A at first is how the RH scripts intergrate with udev. In /etc/dev.d/block/multipath.dev I see: if /sbin/lsmod | /bin/grep "^dm_multipath" > /dev/null 2>&1 && \ [ ${mpath_ctl_fail} -eq 0 ] && \ [ "${ACTION}" = "add" -a "${DEVPATH:7:3}" != "dm-" ] ; then /sbin/multipath -v0 ${DEVNAME} > /dev/null 2>&1 fi I'm assuming that this /sbin/multipath invocation always happens for /dev/sdb before it does for /dev/sdc. That probably means that the path to SP A is the only path to the LU for a brief period, and that is perhaps enough to (bogusly) trigger a trespass - even though no I/O has been sent to the resulting DM device yet. Do you agree? If so I guess that this needs to be handled either by making the multipath utility not cause any trespasses on its own (only a failing path should), or that RH need to rework their scripts so that the multipath utility is not called until the HBA driver is finished registering all the block devices so that both paths are made available to dm-multipath simultaneously. Tore Anderson
RHEL6 has a new failback option, "followover", that allows you to do this. There are no plans to backport this. If this is a problem, please reopen this bug with a comment.