Hide Forgot
Description of problem: Configuring a fence_ipmilan based STONITH device with a power_timeout value that includes an "s": power_timeout=60s Will cause STONITH to fail: pcs stonith fence node1 Error: unable to fence 'node1' Command failed: No route to host The following is present in the cluster DC logs: Sep 20 18:04:58 node2 user.notice python:detected unhandled Python exception in '/usr/sbin/fence_ipmilan' Sep 20 18:05:17 node2 user.notice python:detected unhandled Python exception in '/usr/sbin/fence_ipmilan' Sep 20 18:05:35 node2 daemon.err stonith-ng[4525]: error: Operation 'reboot' [32131] (call 2 from stonith_admin.31703) for host 'node2' with device 'fence_node2_ipmi' returned: -201 (Generic Pacemaker error) Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: db5 error(11) from dbenv->open: Resource temporarily unavailable ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: cannot open Packages index using db5 - Resource temporarily unavailable (11) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: cannot open Packages database in /var/lib/rpm ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ Traceback (most recent call last): ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/sbin/fence_ipmilan", line 186, in <module> ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ main() ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/sbin/fence_ipmilan", line 182, in main ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ result = fence_action(None, options, set_power_status, get_power_status, None, reboot_cycle) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/share/fence/fencing.py", line 964, in fence_action ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ status = get_multi_power_fn(tn, options, get_power_fn) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/share/fence/fencing.py", line 871, in get_multi_power_fn ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ plug_status = get_power_fn(tn, options) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/sbin/fence_ipmilan", line 17, in get_power_status ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ output = run_command(options, create_command(options, "status")) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ File "/usr/share/fence/fencing.py", line 1183, in run_command ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ timeout = float(timeout) ] Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ ValueError: invalid literal for float(): 60s ] Changing the STONITH configuration to remove the "s": [root@node2~]# pcs stonith update fence_node1_ipmi power_timeout=60 Will allow STONITH operations to complete: [root@node2 ~]# pcs stonith fence node1 Node: node1 fenced I would think that the inclusion of "s" to denote seconds should not cause STONITH to fail. Version-Release number of selected component (if applicable): [root@e7359svin1637 ~]# rpm -qa | grep fence-agents-ipmilan fence-agents-ipmilan-4.0.11-27.el7.x86_64 How reproducible: 100% A fence_ipmilan based STONITH device configured with a power_timeout attribute that includes an "s" to denote seconds will fail 100% of the time. Steps to Reproduce: 1.Configure a fence_ipmilan based STONITH device including a power_timeout attribute like "power_timeout=60s" 2.Attempt to fence a cluster node using the STONITH device configured above: # pcs stonith fence node1 Actual results: Fencing fails with a "No route to host error". # pcs stonith fence node1 Error: unable to fence 'node1' Command failed: No route to host Expected results: The node should be fenced: # pcs stonith fence node1 Node: node1 fenced Additional info: None
Hi, I agree that there should be no python exception visible for users and we will fix that. But suffixes like '[smh]' are not used in cluster suite, so we will not support this. Appropriate error message should be displayed.
Hi Marek, An informative error message or just accepting and discarding suffixes like '[smh]' if they are configured would be useful. Thanks. There seems to be some inconsistency around the requirement/acceptance of these suffixes for time based cluster suite parameters. In this case using a suffix causes a failure. In others we are directed to use a suffix in Red Hat High Availability Documentation. e.g. From https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-fencedevicecreate-HAAR.html "The following command creates a stonith device. # pcs stonith create MyStonith fence_virt pcmk_host_list=f1 op monitor interval=30s" e.g. From https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceopts-HAAR.html "In the following example, there is an existing resource named dummy_resource. This command sets the failure-timeout meta option to 20 seconds, so that the resource can attempt to restart on the same node in 20 seconds. # pcs resource meta dummy_resource failure-timeout=20s" If all time based cluster suite parameters can only be configured in seconds then a universal approach to accepting/rejecting/discarding suffixes like '[smh]' would make sense.
I understand your concerns and they make sense, even more with the quotation from the documentation. Those 'interval=30s' are pacemaker/pcs that are not used in fence agent at all. So this transformation from [smh] to seconds should be done in pcs. I will add type 'seconds' so they know which options should be translated. Afterwards, [smh] should work as expected.
Types (second/integer) were added to upstream. https://github.com/ClusterLabs/fence-agents/commit/e0fa4827b2ec931a182a3781cc2223c79cba2563
Looks good, thanks Marek.
See also http://oss.clusterlabs.org/pipermail/users/2017-July/006055.html
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1874