Bug 1377928

Summary: Inclusion of "s" to denote seconds in power_timeout attribute causes fence_ipmilan STONITH devices to fail.
Product: Red Hat Enterprise Linux 7 Reporter: Simon Thomson <simmotommo>
Component: fence-agentsAssignee: Marek Grac <mgrac>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.2CC: cluster-maint, jpokorny, mjuricek, oalbrigt
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: fence-agents-4.0.11-52.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1377970 (view as bug list) Environment:
Last Closed: 2017-08-01 16:10:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1377970    

Description Simon Thomson 2016-09-21 04:02:45 UTC
Description of problem:

Configuring a fence_ipmilan based STONITH device with a power_timeout value that includes an "s": 

power_timeout=60s

Will cause STONITH to fail:

pcs stonith fence node1
Error: unable to fence 'node1'
Command failed: No route to host

The following is present in the cluster DC logs:

Sep 20 18:04:58 node2 user.notice python:detected unhandled Python exception in '/usr/sbin/fence_ipmilan'
Sep 20 18:05:17 node2 user.notice python:detected unhandled Python exception in '/usr/sbin/fence_ipmilan'
Sep 20 18:05:35 node2 daemon.err stonith-ng[4525]:   error: Operation 'reboot' [32131] (call 2 from stonith_admin.31703) for host 'node2' with device 'fence_node2_ipmi' returned: -201
 (Generic Pacemaker error)
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: db5 error(11) from dbenv->open: Resource temporarily unavailable ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: cannot open Packages index using db5 - Resource temporarily unavailable (11) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ error: cannot open Packages database in /var/lib/rpm ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ Traceback (most recent call last): ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/sbin/fence_ipmilan", line 186, in <module> ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     main() ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/sbin/fence_ipmilan", line 182, in main ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     result = fence_action(None, options, set_power_status, get_power_status, None, reboot_cycle) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/share/fence/fencing.py", line 964, in fence_action ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     status = get_multi_power_fn(tn, options, get_power_fn) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/share/fence/fencing.py", line 871, in get_multi_power_fn ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     plug_status = get_power_fn(tn, options) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/sbin/fence_ipmilan", line 17, in get_power_status ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     output = run_command(options, create_command(options, "status")) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [   File "/usr/share/fence/fencing.py", line 1183, in run_command ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [     timeout = float(timeout) ]
Sep 20 18:05:35 node2 daemon.warning stonith-ng[4525]: warning: fence_node1_ipmi:32131 [ ValueError: invalid literal for float(): 60s ]

Changing the STONITH configuration to remove the "s":

[root@node2~]# pcs stonith update fence_node1_ipmi power_timeout=60

Will allow STONITH operations to complete:

[root@node2 ~]# pcs stonith fence node1
Node: node1 fenced

I would think that the inclusion of "s" to denote seconds should not cause STONITH to fail.

Version-Release number of selected component (if applicable):

[root@e7359svin1637 ~]# rpm -qa | grep fence-agents-ipmilan
fence-agents-ipmilan-4.0.11-27.el7.x86_64

How reproducible:

100%

A fence_ipmilan based STONITH device configured with a power_timeout attribute that includes an "s" to denote seconds will fail 100% of the time.

Steps to Reproduce:
1.Configure a fence_ipmilan based STONITH device including a power_timeout attribute like "power_timeout=60s"

2.Attempt to fence a cluster node using the STONITH device configured above:
# pcs stonith fence node1


Actual results:

Fencing fails with a "No route to host error".
# pcs stonith fence node1
Error: unable to fence 'node1'
Command failed: No route to host

Expected results:

The node should be fenced:
# pcs stonith fence node1
Node: node1 fenced

Additional info:

None

Comment 2 Marek Grac 2016-09-21 07:09:00 UTC
Hi,

I agree that there should be no python exception visible for users and we will fix that. 

But suffixes like '[smh]' are not used in cluster suite, so we will not support this. Appropriate error message should be displayed.

Comment 3 Simon Thomson 2016-09-21 07:40:34 UTC
Hi Marek,

An informative error message or just accepting and discarding suffixes like '[smh]' if they are configured would be useful. Thanks.

There seems to be some inconsistency around the requirement/acceptance of these suffixes for time based cluster suite parameters. In this case using a suffix causes a failure. In others we are directed to use a suffix in Red Hat High Availability Documentation.

e.g. From https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-fencedevicecreate-HAAR.html

"The following command creates a stonith device.

# pcs stonith create MyStonith fence_virt pcmk_host_list=f1 op monitor interval=30s"

e.g. From https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceopts-HAAR.html

"In the following example, there is an existing resource named dummy_resource. This command sets the failure-timeout meta option to 20 seconds, so that the resource can attempt to restart on the same node in 20 seconds.
# pcs resource meta dummy_resource failure-timeout=20s"

If all time based cluster suite parameters can only be configured in seconds then a universal approach to accepting/rejecting/discarding suffixes like '[smh]' would make sense.

Comment 4 Marek Grac 2016-09-21 07:50:18 UTC
I understand your concerns and they make sense, even more with the quotation from the documentation. Those 'interval=30s' are pacemaker/pcs that are not used in fence agent at all. So this transformation from [smh] to seconds should be done in pcs. I will add type 'seconds' so they know which options should be translated. Afterwards, [smh] should work as expected.

Comment 5 Marek Grac 2016-09-21 08:07:52 UTC
Types (second/integer) were added to upstream.

https://github.com/ClusterLabs/fence-agents/commit/e0fa4827b2ec931a182a3781cc2223c79cba2563

Comment 6 Simon Thomson 2016-09-22 01:05:25 UTC
Looks good, thanks Marek.

Comment 10 Jan Pokorný [poki] 2017-07-14 15:03:47 UTC
See also http://oss.clusterlabs.org/pipermail/users/2017-July/006055.html

Comment 11 errata-xmlrpc 2017-08-01 16:10:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1874