Login
[x]
Log in using an account from:
Fedora Account System
Red Hat Associate
Red Hat Customer
Or login using a Red Hat Bugzilla account
Forgot Password
Login:
Hide Forgot
Create an Account
Red Hat Bugzilla – Attachment 1451421 Details for
Bug 1427273
Support planned changes in pacemaker failure handling
[?]
New
Simple Search
Advanced Search
My Links
Browse
Requests
Reports
Current State
Search
Tabular reports
Graphical reports
Duplicates
Other Reports
User Changes
Plotly Reports
Bug Status
Bug Severity
Non-Defaults
|
Product Dashboard
Help
Page Help!
Bug Writing Guidelines
What's new
Browser Support Policy
5.0.4.rh83 Release notes
FAQ
Guides index
User guide
Web Services
Contact
Legal
This site requires JavaScript to be enabled to function correctly, please enable it.
[patch]
proposed fix + test
pcs-resource-failcount-overhaul.patch (text/plain), 54.10 KB, created by
Tomas Jelinek
on 2018-06-14 14:17:35 UTC
(
hide
)
Description:
proposed fix + test
Filename:
MIME Type:
Creator:
Tomas Jelinek
Created:
2018-06-14 14:17:35 UTC
Size:
54.10 KB
patch
obsolete
>From b5b1038b8d3d365e0e5139fd6ca79b1a9545f2a6 Mon Sep 17 00:00:00 2001 >From: Tomas Jelinek <tojeline@redhat.com> >Date: Thu, 14 Jun 2018 15:37:49 +0200 >Subject: [PATCH] 'pcs resource failcount' overhaul > >* move it to the new architecture >* fix reading failcounts to support their new format which tracks > failures per operation >* add options to filter failcounts by operation and its interval >--- > CHANGELOG.md | 7 + > pcs/cli/common/lib_wrapper.py | 1 + > pcs/lib/cib/status.py | 105 ++++++ > pcs/lib/cib/test/test_status.py | 368 +++++++++++++++++++++ > pcs/lib/cib/tools.py | 9 +- > pcs/lib/commands/resource.py | 52 ++- > pcs/lib/commands/test/resource/test_failcounts.py | 164 ++++++++++ > pcs/lib/pacemaker/live.py | 8 +- > pcs/pcs.8 | 8 +- > pcs/resource.py | 178 ++++++++--- > pcs/test/test_resource.py | 372 +++++++++++++++++++++- > pcs/usage.py | 21 +- > 12 files changed, 1223 insertions(+), 70 deletions(-) > create mode 100644 pcs/lib/cib/status.py > create mode 100644 pcs/lib/cib/test/test_status.py > create mode 100644 pcs/lib/commands/test/resource/test_failcounts.py > >diff --git a/CHANGELOG.md b/CHANGELOG.md >index 7d82b2a7..92824fbe 100644 >--- a/CHANGELOG.md >+++ b/CHANGELOG.md >@@ -7,6 +7,9 @@ > - Commands for listing and testing watchdog devices ([rhbz#1475318]). > - Option for setting netmtu in `pcs cluster setup` command ([rhbz#1535967]) > - Validation for an unaccessible resource inside a bundle ([rhbz#1462248]) >+- Options to display and filter failures by an operation and its interval in >+ `pcs resource failcount reset` and `pcs resource failcount show` commands >+ ([rhbz#1427273]) > > ### Fixed > - `pcs cib-push diff-against=` does not consider an empty diff as an error >@@ -20,12 +23,15 @@ > - Removing resources using web UI when the operation takes longer than expected > ([rhbz#1579911]) > - Improve 'pcs quorum device add' usage and man page ([rhbz#1476862]) >+- `pcs resource failcount show` works correctly with pacemaker-1.1.18 and newer >+ ([rhbz#1588667]) > > ### Changed > - Watchdog devices are validated against a list provided by sbd > ([rhbz#1475318]). > > [ghpull#166]: https://github.com/ClusterLabs/pcs/pull/166 >+[rhbz#1427273]: https://bugzilla.redhat.com/show_bug.cgi?id=1427273 > [rhbz#1462248]: https://bugzilla.redhat.com/show_bug.cgi?id=1462248 > [rhbz#1475318]: https://bugzilla.redhat.com/show_bug.cgi?id=1475318 > [rhbz#1476862]: https://bugzilla.redhat.com/show_bug.cgi?id=1476862 >@@ -35,6 +41,7 @@ > [rhbz#1574898]: https://bugzilla.redhat.com/show_bug.cgi?id=1574898 > [rhbz#1579911]: https://bugzilla.redhat.com/show_bug.cgi?id=1579911 > [rhbz#1581150]: https://bugzilla.redhat.com/show_bug.cgi?id=1581150 >+[rhbz#1588667]: https://bugzilla.redhat.com/show_bug.cgi?id=1588667 > > > ## [0.9.164] - 2018-04-09 >diff --git a/pcs/cli/common/lib_wrapper.py b/pcs/cli/common/lib_wrapper.py >index 2e5997f6..fc43cb4a 100644 >--- a/pcs/cli/common/lib_wrapper.py >+++ b/pcs/cli/common/lib_wrapper.py >@@ -344,6 +344,7 @@ def load_module(env, middleware_factory, name): > "create_into_bundle": resource.create_into_bundle, > "disable": resource.disable, > "enable": resource.enable, >+ "get_failcounts": resource.get_failcounts, > "manage": resource.manage, > "unmanage": resource.unmanage, > } >diff --git a/pcs/lib/cib/status.py b/pcs/lib/cib/status.py >new file mode 100644 >index 00000000..11b83c1c >--- /dev/null >+++ b/pcs/lib/cib/status.py >@@ -0,0 +1,105 @@ >+from __future__ import ( >+ absolute_import, >+ division, >+ print_function, >+) >+ >+def get_resources_failcounts(cib_status): >+ """ >+ List all resources failcounts >+ Return a dict { >+ "node": string -- node name, >+ "resource": string -- resource id, >+ "clone_id": string -- resource clone id or None, >+ "operation": string -- operation name, >+ "interval": string -- operation interval, >+ "fail_count": "INFINITY" or int -- fail count, >+ "last_failure": int -- last failure timestamp, >+ } >+ >+ etree cib_status -- status element of the CIB >+ """ >+ failcounts = [] >+ for node_state in cib_status.findall("node_state"): >+ node_name = node_state.get("uname") >+ >+ # Pair fail-counts with last-failures. >+ # failures_info = { >+ # failure_name: {"fail_count": count, "last-failure": timestamp} >+ # } >+ failures_info = {} >+ for nvpair in node_state.findall( >+ "transient_attributes/instance_attributes/nvpair" >+ ): >+ name = nvpair.get("name") >+ for part in ("fail-count-", "last-failure-"): >+ if name.startswith(part): >+ failure_name = name[len(part):] >+ if failure_name not in failures_info: >+ failures_info[failure_name] = {} >+ failures_info[failure_name][part[:-1]] = nvpair.get("value") >+ break >+ >+ if not failures_info: >+ continue >+ for failure_name, failure_data in failures_info.items(): >+ resource, clone_id, operation, interval = _parse_failure_name( >+ failure_name >+ ) >+ fail_count = failure_data.get("fail-count", "0").upper() >+ if fail_count != "INFINITY": >+ try: >+ fail_count = int(fail_count) >+ except ValueError: >+ # There are failures we just do not know how many. If we set >+ # fail_count = 0, no failures would be recorded. >+ fail_count = 1 >+ try: >+ last_failure = int(failure_data.get("last-failure", "0")) >+ except ValueError: >+ last_failure = 0 >+ failcounts.append({ >+ "node": node_name, >+ "resource": resource, >+ "clone_id": clone_id, >+ "operation": operation, >+ "interval": interval, >+ "fail_count": fail_count, >+ "last_failure": last_failure, >+ }) >+ return failcounts >+ >+def _parse_failure_name(name): >+ # failure_name looks like this: >+ # <resource_name>[:<clone_id>][#<operation>_<interval>] >+ # resource name is an id so it cannot contain # nor : >+ if "#" in name: >+ resource_clone, operation_interval = name.split("#", 1) >+ else: >+ resource_clone, operation_interval = name, None >+ if ":" in resource_clone: >+ resource, clone = resource_clone.split(":", 1) >+ else: >+ resource, clone = resource_clone, None >+ if operation_interval: >+ operation, interval = operation_interval.rsplit("_", 1) >+ else: >+ operation, interval = None, None >+ return resource, clone, operation, interval >+ >+def filter_resources_failcounts( >+ failcounts, resource=None, node=None, operation=None, interval=None >+): >+ return [ >+ failure for failure in failcounts >+ if ( >+ (node is None or failure["node"] == node) >+ and >+ (resource is None or failure["resource"] == resource) >+ and >+ (operation is None or failure["operation"] == operation) >+ and >+ # 5 != "5", failure["interval"] is a string already >+ (interval is None or failure["interval"] == str(interval)) >+ ) >+ ] >diff --git a/pcs/lib/cib/test/test_status.py b/pcs/lib/cib/test/test_status.py >new file mode 100644 >index 00000000..66f44c3f >--- /dev/null >+++ b/pcs/lib/cib/test/test_status.py >@@ -0,0 +1,368 @@ >+from __future__ import ( >+ absolute_import, >+ division, >+ print_function, >+) >+ >+from lxml import etree >+from unittest import TestCase >+ >+from pcs.lib.cib import status >+ >+ >+class GetResourcesFailcounts(TestCase): >+ def test_no_failures(self): >+ status_xml = etree.fromstring(""" >+ <status> >+ <node_state uname="node1"> >+ <transient_attributes> >+ <instance_attributes> >+ </instance_attributes> >+ </transient_attributes> >+ </node_state> >+ <node_state uname="node2"> >+ <transient_attributes> >+ </transient_attributes> >+ </node_state> >+ <node_state uname="node3"> >+ </node_state> >+ </status> >+ """) >+ self.assertEqual( >+ status.get_resources_failcounts(status_xml), >+ [] >+ ) >+ >+ def test_failures(self): >+ status_xml = etree.fromstring(""" >+ <status> >+ <node_state uname="node1"> >+ <transient_attributes> >+ <instance_attributes> >+ <nvpair name="fail-count-clone:0#start_0" >+ value="INFINITY"/> >+ <nvpair name="last-failure-clone:0#start_0" >+ value="1528871936"/> >+ <nvpair name="fail-count-clone:1#start_0" >+ value="999"/> >+ <nvpair name="last-failure-clone:1#start_0" >+ value="1528871937"/> >+ <nvpair name="fail-count-clone:2" >+ value="888"/> >+ <nvpair name="last-failure-clone:2" >+ value="1528871937"/> >+ </instance_attributes> >+ </transient_attributes> >+ </node_state> >+ <node_state uname="node2"> >+ <transient_attributes> >+ <instance_attributes> >+ <nvpair name="fail-count-resource#monitor_500" >+ value="10"/> >+ <nvpair name="last-failure-resource#monitor_500" >+ value="1528871946"/> >+ <nvpair name="fail-count-no-last#stop_0" >+ value="3"/> >+ <nvpair name="last-failure-no-count#monitor_1000" >+ value="1528871956"/> >+ <nvpair name="ignored-resource#monitor_1000" >+ value="ignored"/> >+ <nvpair name="fail-count-no-int#start_0" >+ value="a few"/> >+ <nvpair name="last-failure-no-int#start_0" >+ value="an hour ago"/> >+ <nvpair name="fail-count-no-op" >+ value="42"/> >+ <nvpair name="last-failure-no-op" >+ value="1528871942"/> >+ </instance_attributes> >+ </transient_attributes> >+ </node_state> >+ </status> >+ """) >+ self.assertEqual( >+ sorted( >+ status.get_resources_failcounts(status_xml), >+ key=lambda x: [str(x[key]) for key in sorted(x.keys())] >+ ), >+ sorted([ >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "0", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "1", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": 999, >+ "last_failure": 1528871937, >+ }, >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "2", >+ "operation": None, >+ "interval": None, >+ "fail_count": 888, >+ "last_failure": 1528871937, >+ }, >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "500", >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ { >+ "node": "node2", >+ "resource": "no-last", >+ "clone_id": None, >+ "operation": "stop", >+ "interval": "0", >+ "fail_count": 3, >+ "last_failure": 0, >+ }, >+ { >+ "node": "node2", >+ "resource": "no-count", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1000", >+ "fail_count": 0, >+ "last_failure": 1528871956, >+ }, >+ { >+ "node": "node2", >+ "resource": "no-int", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": 1, >+ "last_failure": 0, >+ }, >+ { >+ "node": "node2", >+ "resource": "no-op", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": 42, >+ "last_failure": 1528871942, >+ }, >+ ], >+ key=lambda x: [str(x[key]) for key in sorted(x.keys())] >+ ) >+ ) >+ >+class ParseFailureName(TestCase): >+ def test_resource(self): >+ self.assertEqual( >+ status._parse_failure_name("resource"), >+ ("resource", None, None, None) >+ ) >+ >+ def test_resource_clone_id(self): >+ self.assertEqual( >+ status._parse_failure_name("resource:1"), >+ ("resource", "1", None, None) >+ ) >+ >+ def test_resource_operation(self): >+ self.assertEqual( >+ status._parse_failure_name("resource#monitor_1000"), >+ ("resource", None, "monitor", "1000") >+ ) >+ >+ def test_resource_clone_id_operation(self): >+ self.assertEqual( >+ status._parse_failure_name("resource:2#monitor_1000"), >+ ("resource", "2", "monitor", "1000") >+ ) >+ >+class FilterResourceFailcounts(TestCase): >+ # pylint: disable=too-many-instance-attributes >+ def setUp(self): >+ self.fail_01 = { >+ "node": "nodeA", >+ "resource": "resourceA", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_02 = { >+ "node": "nodeA", >+ "resource": "resourceB", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1000", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_03 = { >+ "node": "nodeB", >+ "resource": "resourceA", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1000", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_04 = { >+ "node": "nodeB", >+ "resource": "resourceB", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_05 = { >+ "node": "nodeB", >+ "resource": "resourceA", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "500", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_06 = { >+ "node": "nodeB", >+ "resource": "resourceA", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_07 = { >+ "node": "nodeB", >+ "resource": "resourceB", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1000", >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.fail_08 = { >+ "node": "nodeA", >+ "resource": "resourceA", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": "100", >+ } >+ self.failures = [ >+ self.fail_01, self.fail_02, self.fail_03, self.fail_04, >+ self.fail_05, self.fail_06, self.fail_07, self.fail_08, >+ ] >+ >+ def test_no_filter(self): >+ self.assertEqual( >+ status.filter_resources_failcounts(self.failures), >+ self.failures >+ ) >+ >+ def test_no_match(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, resource="resourceX" >+ ), >+ [] >+ ) >+ >+ def test_filter_by_resource(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, resource="resourceA" >+ ), >+ [ >+ self.fail_01, self.fail_03, self.fail_05, self.fail_06, >+ self.fail_08 >+ ] >+ ) >+ >+ def test_filter_by_node(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, node="nodeA" >+ ), >+ [self.fail_01, self.fail_02, self.fail_08] >+ ) >+ >+ def test_filter_by_operation(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, operation="monitor" >+ ), >+ [self.fail_02, self.fail_03, self.fail_05, self.fail_07] >+ ) >+ >+ def test_filter_by_operation_and_interval(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, operation="monitor", interval="500" >+ ), >+ [self.fail_05] >+ ) >+ >+ def test_filter_by_operation_and_interval_int(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, operation="monitor", interval=500 >+ ), >+ [self.fail_05] >+ ) >+ >+ def test_filter_by_resource_and_node(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, resource="resourceA", node="nodeB" >+ ), >+ [self.fail_03, self.fail_05, self.fail_06] >+ ) >+ >+ def test_filter_by_resource_and_node_and_operation(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, resource="resourceA", node="nodeB", >+ operation="monitor" >+ ), >+ [self.fail_03, self.fail_05] >+ ) >+ >+ def test_filter_by_resource_and_node_and_operation_and_interval(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, resource="resourceA", node="nodeB", >+ operation="monitor", interval="1000" >+ ), >+ [self.fail_03] >+ ) >+ >+ def test_filter_by_node_and_operation(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, node="nodeB", operation="monitor" >+ ), >+ [self.fail_03, self.fail_05, self.fail_07] >+ ) >+ >+ def test_filter_by_node_and_operation_and_interval(self): >+ self.assertEqual( >+ status.filter_resources_failcounts( >+ self.failures, node="nodeB", operation="monitor", interval="1000" >+ ), >+ [self.fail_03, self.fail_07] >+ ) >diff --git a/pcs/lib/cib/tools.py b/pcs/lib/cib/tools.py >index 59724d43..2cff96f3 100644 >--- a/pcs/lib/cib/tools.py >+++ b/pcs/lib/cib/tools.py >@@ -14,7 +14,7 @@ from pcs.lib.pacemaker.values import ( > sanitize_id, > validate_id, > ) >-from pcs.lib.xml_tools import get_root >+from pcs.lib.xml_tools import get_root, get_sub_element > > _VERSION_FORMAT = r"(?P<major>\d+)\.(?P<minor>\d+)(\.(?P<rev>\d+))?" > >@@ -247,6 +247,13 @@ def get_resources(tree): > """ > return sections.get(tree, sections.RESOURCES) > >+def get_status(tree): >+ """ >+ Return the 'status' element from the tree >+ tree -- cib etree node >+ """ >+ return get_sub_element(tree, "status") >+ > def _get_cib_version(cib, attribute, regexp, none_if_missing=False): > version = cib.get(attribute) > if version is None: >diff --git a/pcs/lib/commands/resource.py b/pcs/lib/commands/resource.py >index 93e541c0..5637113c 100644 >--- a/pcs/lib/commands/resource.py >+++ b/pcs/lib/commands/resource.py >@@ -10,16 +10,23 @@ from functools import partial > from pcs.common import report_codes > from pcs.common.tools import Version > from pcs.lib import reports >-from pcs.lib.cib import resource >+from pcs.lib.cib import ( >+ resource, >+ status as cib_status, >+) > from pcs.lib.cib.resource import operations, remote_node, guest_node > from pcs.lib.cib.tools import ( > find_element_by_tag_and_id, > get_resources, >+ get_status, > IdProvider, > ) > from pcs.lib.env_tools import get_nodes > from pcs.lib.errors import LibraryError >-from pcs.lib.pacemaker.values import validate_id >+from pcs.lib.pacemaker.values import ( >+ timeout_to_seconds, >+ validate_id, >+) > from pcs.lib.pacemaker.state import ( > ensure_resource_state, > info_resource_state, >@@ -29,6 +36,7 @@ from pcs.lib.pacemaker.state import ( > from pcs.lib.resource_agent import( > find_valid_resource_agent_by_name as get_agent > ) >+from pcs.lib.validate import value_time_interval > > @contextmanager > def resource_environment( >@@ -781,6 +789,46 @@ def manage(env, resource_ids, with_monitor=False): > > env.report_processor.process_list(report_list) > >+def get_failcounts( >+ env, resource=None, node=None, operation=None, interval=None >+): >+ """ >+ List resources failcounts, optionally filtered by a resource, node or op >+ >+ LibraryEnvironment env >+ string resource -- show failcounts for the specified resource only >+ string node -- show failcounts for the specified node only >+ string operation -- show failcounts for the specified operation only >+ string interval -- show failcounts for the specified operation interval only >+ """ >+ report_items = [] >+ if interval is not None and operation is None: >+ report_items.append( >+ reports.prerequisite_option_is_missing("interval", "operation") >+ ) >+ if interval is not None: >+ report_items.extend( >+ value_time_interval("interval")({"interval": interval}) >+ ) >+ if report_items: >+ raise LibraryError(*report_items) >+ >+ interval_ms = ( >+ None if interval is None >+ else timeout_to_seconds(interval) * 1000 >+ ) >+ >+ all_failcounts = cib_status.get_resources_failcounts( >+ get_status(env.get_cib()) >+ ) >+ return cib_status.filter_resources_failcounts( >+ all_failcounts, >+ resource=resource, >+ node=node, >+ operation=operation, >+ interval=interval_ms >+ ) >+ > def _find_resources_or_raise( > resources_section, resource_ids, additional_search=None > ): >diff --git a/pcs/lib/commands/test/resource/test_failcounts.py b/pcs/lib/commands/test/resource/test_failcounts.py >new file mode 100644 >index 00000000..0a3437ba >--- /dev/null >+++ b/pcs/lib/commands/test/resource/test_failcounts.py >@@ -0,0 +1,164 @@ >+from __future__ import ( >+ absolute_import, >+ division, >+ print_function, >+) >+ >+from unittest import TestCase >+ >+from pcs.common import report_codes >+from pcs.lib.commands import resource >+from pcs.test.tools import fixture >+from pcs.test.tools.command_env import get_env_tools >+ >+class GetFailcounts(TestCase): >+ def setUp(self): >+ self.env_assist, self.config = get_env_tools(test_case=self) >+ >+ def fixture_cib(self): >+ return """ >+ <cib> >+ <status> >+ <node_state uname="node1"> >+ <transient_attributes> >+ <instance_attributes> >+ <nvpair name="fail-count-resource" >+ value="INFINITY"/> >+ <nvpair name="last-failure-resource" >+ value="1528871936"/> >+ </instance_attributes> >+ </transient_attributes> >+ </node_state> >+ <node_state uname="node2"> >+ <transient_attributes> >+ <instance_attributes> >+ <nvpair name="fail-count-resource#monitor_5000" >+ value="10"/> >+ <nvpair name="last-failure-resource#monitor_5000" >+ value="1528871946"/> >+ </instance_attributes> >+ </transient_attributes> >+ </node_state> >+ </status> >+ </cib> >+ """ >+ >+ def test_operation_requires_interval(self): >+ self.env_assist.assert_raise_library_error( >+ lambda: resource.get_failcounts( >+ self.env_assist.get_env(), interval="1000" >+ ), >+ [ >+ fixture.error( >+ report_codes.PREREQUISITE_OPTION_IS_MISSING, >+ option_name="interval", >+ option_type="", >+ prerequisite_name="operation", >+ prerequisite_type="" >+ ), >+ ], >+ expected_in_processor=False >+ ) >+ >+ def test_bad_interval(self): >+ self.env_assist.assert_raise_library_error( >+ lambda: resource.get_failcounts( >+ self.env_assist.get_env(), operation="start", interval="often" >+ ), >+ [ >+ fixture.error( >+ report_codes.INVALID_OPTION_VALUE, >+ option_name="interval", >+ option_value="often", >+ allowed_values="time interval (e.g. 1, 2s, 3m, 4h, ...)" >+ ), >+ ], >+ expected_in_processor=False >+ ) >+ >+ def test_all_validation_errors(self): >+ self.env_assist.assert_raise_library_error( >+ lambda: resource.get_failcounts( >+ self.env_assist.get_env(), interval="often" >+ ), >+ [ >+ fixture.error( >+ report_codes.PREREQUISITE_OPTION_IS_MISSING, >+ option_name="interval", >+ option_type="", >+ prerequisite_name="operation", >+ prerequisite_type="" >+ ), >+ fixture.error( >+ report_codes.INVALID_OPTION_VALUE, >+ option_name="interval", >+ option_value="often", >+ allowed_values="time interval (e.g. 1, 2s, 3m, 4h, ...)" >+ ), >+ ], >+ expected_in_processor=False >+ ) >+ >+ def test_get_all(self): >+ self.config.runner.cib.load_content(self.fixture_cib()) >+ self.assertEqual( >+ resource.get_failcounts(self.env_assist.get_env()), >+ [ >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "5000", >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ ] >+ ) >+ >+ def test_filter_node(self): >+ self.config.runner.cib.load_content(self.fixture_cib()) >+ self.assertEqual( >+ resource.get_failcounts( >+ self.env_assist.get_env(), node="node2" >+ ), >+ [ >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "5000", >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ ] >+ ) >+ >+ def test_filter_interval(self): >+ self.config.runner.cib.load_content(self.fixture_cib()) >+ self.assertEqual( >+ resource.get_failcounts( >+ self.env_assist.get_env(), operation="monitor", interval="5" >+ ), >+ [ >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "5000", >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ ] >+ ) >diff --git a/pcs/lib/pacemaker/live.py b/pcs/lib/pacemaker/live.py >index 2afc9d06..a2bc2e30 100644 >--- a/pcs/lib/pacemaker/live.py >+++ b/pcs/lib/pacemaker/live.py >@@ -313,12 +313,18 @@ def remove_node(runner, node_name): > > ### resources > >-def resource_cleanup(runner, resource=None, node=None): >+def resource_cleanup( >+ runner, resource=None, node=None, operation=None, interval=None >+): > cmd = [__exec("crm_resource"), "--cleanup"] > if resource: > cmd.extend(["--resource", resource]) > if node: > cmd.extend(["--node", node]) >+ if operation: >+ cmd.extend(["--operation", operation]) >+ if interval: >+ cmd.extend(["--interval", interval]) > > stdout, stderr, retval = runner.run(cmd) > >diff --git a/pcs/pcs.8 b/pcs/pcs.8 >index 496a19b3..e6ad974b 100644 >--- a/pcs/pcs.8 >+++ b/pcs/pcs.8 >@@ -183,11 +183,11 @@ Make the cluster forget failed operations from history of the resource and re\-d > refresh [<resource id>] [\fB\-\-node\fR <node>] [\fB\-\-full\fR] > Make the cluster forget the complete operation history (including failures) of the resource and re\-detect its current state. If you are interested in forgetting failed operations only, use the 'pcs resource cleanup' command. If a resource id is not specified then all resources / stonith devices will be refreshed. If a node is not specified then resources / stonith devices on all nodes will be refreshed. Use \fB\-\-full\fR to refresh a resource on all nodes, otherwise only nodes where the resource's state is known will be considered. > .TP >-failcount show <resource id> [node] >-Show current failcount for specified resource from all nodes or only on specified node. >+failcount show [<resource id> [<node> [<operation> [<interval>]]]] [\fB\-\-full\fR] >+Show current failcount for resources, optionally filtered by a resource, node, operation and its interval. If \fB\-\-full\fR is specified do not sum failcounts per resource and node. Operation, interval and \fB\-\-full\fR options require pacemaker\-1.1.18 or newer. > .TP >-failcount reset <resource id> [node] >-Reset failcount for specified resource on all nodes or only on specified node. This tells the cluster to forget how many times a resource has failed in the past. This may allow the resource to be started or moved to a more preferred location. >+failcount reset [<resource id> [<node> [<operation> [<interval>]]]] >+Reset failcount for specified resource on all nodes or only on specified node. This tells the cluster to forget how many times a resource has failed in the past. This may allow the resource to be started or moved to a more preferred location. Operation and interval options require pacemaker-1.1.18 or newer. > .TP > relocate dry\-run [resource1] [resource2] ... > The same as 'relocate run' but has no effect on the cluster. >diff --git a/pcs/resource.py b/pcs/resource.py >index 502cd7ce..001bad50 100644 >--- a/pcs/resource.py >+++ b/pcs/resource.py >@@ -148,7 +148,7 @@ def resource_cmd(argv): > elif sub_cmd == "unmanage": > resource_unmanage_cmd(lib, argv_next, modifiers) > elif sub_cmd == "failcount": >- resource_failcount(argv_next) >+ resource_failcount(lib, argv_next, modifiers) > elif sub_cmd == "op": > if len(argv_next) < 1: > usage.resource(["op"]) >@@ -2223,65 +2223,139 @@ def is_managed(resource_id): > return True > utils.err("unable to find a resource/clone/master/group: %s" % resource_id) > >-def resource_failcount(argv): >- if len(argv) < 2: >- usage.resource() >- sys.exit(1) >+def resource_failcount(lib, argv, modifiers): >+ if len(argv) < 1: >+ raise CmdLineInputError() > >- resource_command = argv.pop(0) >- resource = argv.pop(0) >- if resource_command != "show" and resource_command != "reset": >- usage.resource() >- sys.exit(1) >+ command = argv.pop(0) > >- if len(argv) > 0: >- node = argv.pop(0) >- all_nodes = False >- else: >- all_nodes = True >+ resource = argv.pop(0) if argv else None >+ node = argv.pop(0) if argv else None >+ operation = argv.pop(0) if argv else None >+ interval = argv.pop(0) if argv else None > >- dom = utils.get_cib_dom() >- output_dict = {} >- trans_attrs = dom.getElementsByTagName("transient_attributes") >- fail_counts_removed = 0 >- for ta in trans_attrs: >- ta_node = ta.parentNode.getAttribute("uname") >- if not all_nodes and ta_node != node: >- continue >- for nvp in ta.getElementsByTagName("nvpair"): >- if nvp.getAttribute("name") == ("fail-count-" + resource): >- if resource_command == "reset": >- (output, retval) = utils.run(["crm_attribute", "-N", >- ta_node, "-n", nvp.getAttribute("name"), "-t", >- "status", "-D"]) >- if retval != 0: >- utils.err("Unable to remove failcounts from %s on %s\n" % (resource,ta_node) + output) >- fail_counts_removed = fail_counts_removed + 1 >- else: >- output_dict[ta_node] = " " + ta_node + ": " + nvp.getAttribute("value") >- break >+ if command == "show": >+ print(resource_failcount_show( >+ lib, resource, node, operation, interval, modifiers["full"] >+ )) >+ return > >- if resource_command == "reset": >- if fail_counts_removed == 0: >- print("No failcounts needed resetting") >- if resource_command == "show": >- output = [] >- for key in sorted(output_dict.keys()): >- output.append(output_dict[key]) >+ if command == "reset": >+ print(lib_pacemaker.resource_cleanup( >+ utils.cmd_runner(), >+ resource=resource, >+ node=node, >+ operation=operation, >+ interval=interval >+ )) >+ return > >+ raise CmdLineInputError() > >- if not output: >- if all_nodes: >- print("No failcounts for %s" % resource) >+def __agregate_failures(failure_list): >+ last_failure = 0 >+ fail_count = 0 >+ for failure in failure_list: >+ # infinity is a maximal value and cannot be increased >+ if fail_count != "INFINITY": >+ if failure["fail_count"] == "INFINITY": >+ fail_count = failure["fail_count"] > else: >- print("No failcounts for %s on %s" % (resource,node)) >- else: >- if all_nodes: >- print("Failcounts for %s" % resource) >- else: >- print("Failcounts for %s on %s" % (resource,node)) >- print("\n".join(output)) >+ fail_count += failure["fail_count"] >+ last_failure = max(last_failure, failure["last_failure"]) >+ return fail_count, last_failure >+ >+def __headline_resource_failures(empty, resource, node, operation, interval): >+ headline_parts = [] >+ if empty: >+ headline_parts.append("No failcounts") >+ else: >+ headline_parts.append("Failcounts") >+ if operation: >+ headline_parts.append("for operation '{operation}'") >+ if interval: >+ headline_parts.append("with interval '{interval}'") >+ if resource: >+ headline_parts.append("of" if operation else "for") >+ headline_parts.append("resource '{resource}'") >+ if node: >+ headline_parts.append("on node '{node}'") >+ return " ".join(headline_parts).format( >+ node=node, resource=resource, operation=operation, >+ interval=interval >+ ) >+ >+def resource_failcount_show(lib, resource, node, operation, interval, full): >+ result_lines = [] >+ failures_data = lib.resource.get_failcounts( >+ resource=resource, >+ node=node, >+ operation=operation, >+ interval=interval >+ ) >+ >+ if not failures_data: >+ result_lines.append(__headline_resource_failures( >+ True, resource, node, operation, interval >+ )) >+ return "\n".join(result_lines) > >+ resource_list = sorted(set([fail["resource"] for fail in failures_data])) >+ for current_resource in resource_list: >+ result_lines.append(__headline_resource_failures( >+ False, current_resource, node, operation, interval >+ )) >+ resource_failures = [ >+ fail for fail in failures_data >+ if fail["resource"] == current_resource >+ ] >+ node_list = sorted(set([fail["node"] for fail in resource_failures])) >+ for current_node in node_list: >+ node_failures = [ >+ fail for fail in resource_failures >+ if fail["node"] == current_node >+ ] >+ has_operation = True >+ for fail in node_failures: >+ if fail["operation"] is None or fail["interval"] is None: >+ has_operation = False >+ break >+ if full and has_operation: >+ result_lines.append(" {}:".format(current_node)) >+ operation_list = sorted(set( >+ [fail["operation"] for fail in node_failures] >+ )) >+ for current_operation in operation_list: >+ operation_failures = [ >+ fail for fail in node_failures >+ if fail["operation"] == current_operation >+ ] >+ interval_list = sorted( >+ set([fail["interval"] for fail in operation_failures]), >+ # pacemaker's definition of infinity >+ key=lambda x: 1000000 if x == "INFINITY" else x >+ ) >+ for current_interval in interval_list: >+ interval_failures = [ >+ fail for fail in operation_failures >+ if fail["interval"] == current_interval >+ ] >+ failcount, dummy_last_failure = __agregate_failures( >+ interval_failures >+ ) >+ result_lines.append( >+ " {0} {1}ms: {2}".format( >+ current_operation, current_interval, failcount >+ ) >+ ) >+ else: >+ failcount, dummy_last_failure = __agregate_failures( >+ node_failures >+ ) >+ result_lines.append( >+ " {0}: {1}".format(current_node, failcount) >+ ) >+ return "\n".join(result_lines) > > def show_defaults(def_type, indent=""): > dom = utils.get_cib_dom() >diff --git a/pcs/test/test_resource.py b/pcs/test/test_resource.py >index 1b6013ca..59432999 100644 >--- a/pcs/test/test_resource.py >+++ b/pcs/test/test_resource.py >@@ -6,6 +6,7 @@ from __future__ import ( > > from lxml import etree > import re >+from random import shuffle > import shutil > from textwrap import dedent > >@@ -15,7 +16,7 @@ from pcs.test.tools.assertions import ( > AssertPcsMixin, > ) > from pcs.test.tools.cib import get_assert_pcs_effect_mixin >-from pcs.test.tools.pcs_unittest import TestCase >+from pcs.test.tools.pcs_unittest import mock, TestCase > from pcs.test.tools.misc import ( > get_test_resource as rc, > outdent, >@@ -5214,3 +5215,372 @@ class ResourceUpdateSpcialChecks(unittest.TestCase, AssertPcsMixin): > "Warning: this command is not sufficient for removing a guest node," > " use 'pcs cluster node remove-guest'\n" > ) >+ >+class FailcountShow(TestCase): >+ def setUp(self): >+ self.lib = mock.Mock(spec_set=["resource"]) >+ self.resource = mock.Mock(spec_set=["get_failcounts"]) >+ self.get_failcounts = mock.Mock() >+ self.lib.resource = self.resource >+ self.lib.resource.get_failcounts = self.get_failcounts >+ >+ def assert_failcount_output( >+ self, lib_failures, expected_output, resource_id=None, node=None, >+ operation=None, interval=None, full=False >+ ): >+ self.get_failcounts.return_value = lib_failures >+ ac( >+ resource.resource_failcount_show( >+ self.lib, resource_id, node, operation, interval, full >+ ), >+ expected_output >+ ) >+ >+ def fixture_failures_no_op(self): >+ failures = [ >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "0", >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "1", >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node2", >+ "resource": "clone", >+ "clone_id": "0", >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node2", >+ "resource": "clone", >+ "clone_id": "1", >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": 100, >+ "last_failure": 1528871966, >+ }, >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": "INFINITY", >+ "last_failure": 1528871966, >+ }, >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": None, >+ "interval": None, >+ "fail_count": 150, >+ "last_failure": 1528871956, >+ }, >+ ] >+ shuffle(failures) >+ return failures >+ >+ def fixture_failures_monitor(self): >+ failures = [ >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "500", >+ "fail_count": 10, >+ "last_failure": 1528871946, >+ }, >+ { >+ "node": "node2", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1500", >+ "fail_count": 150, >+ "last_failure": 1528871956, >+ }, >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "monitor", >+ "interval": "1500", >+ "fail_count": 25, >+ "last_failure": 1528871966, >+ }, >+ ] >+ shuffle(failures) >+ return failures >+ >+ def fixture_failures(self): >+ failures = self.fixture_failures_monitor() + [ >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "0", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node1", >+ "resource": "clone", >+ "clone_id": "1", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node2", >+ "resource": "clone", >+ "clone_id": "0", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node2", >+ "resource": "clone", >+ "clone_id": "1", >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871936, >+ }, >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": 100, >+ "last_failure": 1528871966, >+ }, >+ { >+ "node": "node1", >+ "resource": "resource", >+ "clone_id": None, >+ "operation": "start", >+ "interval": "0", >+ "fail_count": "INFINITY", >+ "last_failure": 1528871966, >+ }, >+ ] >+ shuffle(failures) >+ return failures >+ >+ def test_no_failcounts(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts" >+ ) >+ >+ def test_no_failcounts_resource(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for resource 'res'", >+ resource_id="res" >+ ) >+ >+ def test_no_failcounts_node(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts on node 'nod'", >+ node="nod" >+ ) >+ >+ def test_no_failcounts_operation(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope'", >+ operation="ope" >+ ) >+ >+ def test_no_failcounts_operation_interval(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' with interval '10'", >+ operation="ope", >+ interval="10" >+ ) >+ >+ def test_no_failcounts_resource_node(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for resource 'res' on node 'nod'", >+ resource_id="res", >+ node="nod" >+ ) >+ >+ def test_no_failcounts_resource_operation(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' of resource 'res'", >+ resource_id="res", >+ operation="ope", >+ ) >+ >+ def test_no_failcounts_resource_operation_interval(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' with interval '10' of resource " >+ "'res'", >+ resource_id="res", >+ operation="ope", >+ interval="10" >+ ) >+ >+ def test_no_failcounts_resource_node_operation_interval(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' with interval '10' of resource " >+ "'res' on node 'nod'", >+ resource_id="res", >+ node="nod", >+ operation="ope", >+ interval="10" >+ ) >+ >+ def test_no_failcounts_node_operation(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' on node 'nod'", >+ node="nod", >+ operation="ope", >+ ) >+ >+ def test_no_failcounts_node_operation_interval(self): >+ self.assert_failcount_output( >+ [], >+ "No failcounts for operation 'ope' with interval '10' on node 'nod'", >+ node="nod", >+ operation="ope", >+ interval="10" >+ ) >+ >+ def test_failcounts_short(self): >+ self.assert_failcount_output( >+ self.fixture_failures(), >+ dedent("""\ >+ Failcounts for resource 'clone' >+ node1: INFINITY >+ node2: INFINITY >+ Failcounts for resource 'resource' >+ node1: INFINITY >+ node2: 160""" >+ ), >+ full=False >+ ) >+ >+ def test_failcounts_full(self): >+ self.assert_failcount_output( >+ self.fixture_failures(), >+ dedent("""\ >+ Failcounts for resource 'clone' >+ node1: >+ start 0ms: INFINITY >+ node2: >+ start 0ms: INFINITY >+ Failcounts for resource 'resource' >+ node1: >+ monitor 1500ms: 25 >+ start 0ms: INFINITY >+ node2: >+ monitor 1500ms: 150 >+ monitor 500ms: 10""" >+ ), >+ full=True >+ ) >+ >+ def test_failcounts_short_filter(self): >+ self.assert_failcount_output( >+ self.fixture_failures_monitor(), >+ dedent("""\ >+ Failcounts for operation 'monitor' of resource 'resource' >+ node1: 25 >+ node2: 160""" >+ ), >+ operation="monitor", >+ full=False >+ ) >+ >+ def test_failcounts_full_filter(self): >+ self.assert_failcount_output( >+ self.fixture_failures_monitor(), >+ dedent("""\ >+ Failcounts for operation 'monitor' of resource 'resource' >+ node1: >+ monitor 1500ms: 25 >+ node2: >+ monitor 1500ms: 150 >+ monitor 500ms: 10""" >+ ), >+ operation="monitor", >+ full=True >+ ) >+ >+ def test_failcounts_no_op_short(self): >+ self.assert_failcount_output( >+ self.fixture_failures_no_op(), >+ dedent("""\ >+ Failcounts for resource 'clone' >+ node1: INFINITY >+ node2: INFINITY >+ Failcounts for resource 'resource' >+ node1: INFINITY >+ node2: 160""" >+ ), >+ full=False >+ ) >+ >+ def test_failcounts_no_op_full(self): >+ self.assert_failcount_output( >+ self.fixture_failures_no_op(), >+ dedent("""\ >+ Failcounts for resource 'clone' >+ node1: INFINITY >+ node2: INFINITY >+ Failcounts for resource 'resource' >+ node1: INFINITY >+ node2: 160""" >+ ), >+ full=True >+ ) >diff --git a/pcs/usage.py b/pcs/usage.py >index 5e0ac976..1f93213d 100644 >--- a/pcs/usage.py >+++ b/pcs/usage.py >@@ -491,15 +491,18 @@ Commands: > to refresh a resource on all nodes, otherwise only nodes where the > resource's state is known will be considered. > >- failcount show <resource id> [node] >- Show current failcount for specified resource from all nodes or >- only on specified node. >- >- failcount reset <resource id> [node] >- Reset failcount for specified resource on all nodes or only on >- specified node. This tells the cluster to forget how many times >- a resource has failed in the past. This may allow the resource to >- be started or moved to a more preferred location. >+ failcount show [<resource id> [<node> [<operation> [<interval>]]]] [--full] >+ Show current failcount for resources, optionally filtered by a resource, >+ node, operation and its interval. If --full is specified do not sum >+ failcounts per resource and node. Operation, interval and --full >+ options require pacemaker-1.1.18 or newer. >+ >+ failcount reset [<resource id> [<node> [<operation> [interval=<interval>]]]] >+ Reset failcount for specified resource on all nodes or only on specified >+ node. This tells the cluster to forget how many times a resource has >+ failed in the past. This may allow the resource to be started or moved >+ to a more preferred location. Operation and interval options require >+ pacemaker-1.1.18 or newer. > > relocate dry-run [resource1] [resource2] ... > The same as 'relocate run' but has no effect on the cluster. >-- >2.11.0 >
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Diff
View Attachment As Raw
Actions:
View
|
Diff
Attachments on
bug 1427273
: 1451421