Guess suggested approach should be fine. While it might of course be arguable if we have to check that at all as sbd (called as client cmdline tool) is returning success anyway just in case it was able to successfully write to a majority of the devices. Anyway it shouldn't hurt and maybe there was a reason to check I'm not aware of atm.
(In reply to Klaus Wenninger from comment #1) > Guess suggested approach should be fine. > > While it might of course be arguable if we have to check > that at all as sbd (called as client cmdline tool) is > returning success anyway just in case it was able to > successfully write to a majority of the devices. > > Anyway it shouldn't hurt and maybe there was a reason > to check I'm not aware of atm. As we're doing the checks already for all actions on the fencing device (incl. monitor) we might just keep it as it is as we will get a heads-up in the logs that tells us that there is something fishy with the devices prior actual fencing. In case of actual fencing we should just proceed anyway. Verification of msg-timeout (for consistency on all the devices + with power-timeout) should probably just be done if all devices are readable though. Probably we shouldn't bail out at the first failing device for the sake of a more complete info.
(In reply to Klaus Wenninger from comment #1) > Guess suggested approach should be fine. > > While it might of course be arguable if we have to check > that at all as sbd (called as client cmdline tool) is > returning success anyway just in case it was able to > successfully write to a majority of the devices. > > Anyway it shouldn't hurt and maybe there was a reason > to check I'm not aware of atm. Looking as if a fix is substantially more complicated than anticipated. Not doing the device-check for "reboot" & "off" is of course one step in the right direction. But we will still block on 3 things: - list triggered by pacemaker (can be prevented setting pcmk_host_list) - list triggered by fence_sbd via get_power_status - dump triggered by fence_sbd retrieving msg-wait-timeout There is btw. a still open PR (https://github.com/ClusterLabs/sbd/pull/119) for similar issues with the fencing-script coming with the sbd-package. One quick way would be going the same route in fence_sbd as suggested in that PR (trigger actions separately for each device in parallel and continue once there is a result from one of the devices - maybe waiting for a 2nd makes sense ...). Of course this double-implementation shows the uglyness of the approach in comparison to an implementation as part of the sbd-binary itself where a parallel approach is already implemented for the actual slot-messaging. Alternatively we could memorize/update node-list & msg-wait-timeout in attributes on "status" and use them on "list", "reboot" & "off" with the only sbd-interaction for those left being the actual messaging that just blocks until an error occurs or a quorate number of messages is handed over.
(In reply to Klaus Wenninger from comment #12) > Alternatively we could memorize/update node-list & msg-wait-timeout > in attributes on "status" and use them on "list", "reboot" & "off" > with the only sbd-interaction for those left being the actual > messaging that just blocks until an error occurs or a quorate number > of messages is handed over. This approach would work if a device becomes inaccessible over time somehow or from one of the nodes. We would still have the data in the attributes. But if we want to cover the situation as well where one of the disks isn't accessible creating excessive hang-time (2min in my test-case using "dmsetup suspend ...") we'd need some parallel execution still.
There is an upstream PR available that should be able to cope with the current state of the sbd-binary. https://github.com/ClusterLabs/fence-agents/pull/497
Unfortunately due to resource constraints and reprioritizations I don't have a good estimate when this work will be completed.