Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1609453

Summary:

Failed monitor with on-block=fail outside of resource group causes restart of resources inside of resource group

Product:

Red Hat Enterprise Linux 7

Reporter:

Ondrej Faměra <ofamera>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.5

CC:

abeekhof, cluster-maint, mmazoure, phagara, sbradley

Target Milestone:

Target Release:

7.7

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pacemaker-1.1.20-3.el7

Doc Type:

Bug Fix

Doc Text:

This is not worth a release note, but for reference: Cause: Pacemaker applied an ordering for restarts equally to group member stops. Consequence: When a resource has a constraint dependency on a member of a group, and an action with on-fail=block, later members of that group could unnecessarily restart when the action fails. Fix: The restart handling is only applied to restarts. Result: Resources are not unnecessarily stopped.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-08-06 12:53:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
data from reproducer	none

Description Ondrej Faměra 2018-07-28 01:56:17 UTC

Created attachment 1471183 [details]
data from reproducer

== Description of problem:
We have a resource outside of resource group (outside_resource) that has monitor operation with "on-fail=block" behaviour and ordering constraint where this outside_resource depends on "non-last" resource in resource group (inside_resource_2). When the monitor operation on this outside_resource fails, then it will get blocked as expected, but the inside_resource3 got also restarted.

 outside_resource       (ocf::pacemaker:Dummy): FAILED fastvm-rhel-7-5-73 (blocked)
 Resource Group: grp
     inside_resource_2  (ocf::pacemaker:Dummy): Started fastvm-rhel-7-5-74
     inside_resource_3  (ocf::pacemaker:Dummy): Started fastvm-rhel-7-5-74 <---- was restarted when outside_resource 'monitor' failed

 Ordering Constraints:
   start inside_resource_2 then start outside_resource (kind:Mandatory)

== Version-Release number of selected component (if applicable):
Reproduced in latest RHEL 7.5:
  pacemaker-cli-1.1.18-11.el7_5.3.x86_64
  pacemaker-1.1.18-11.el7_5.3.x86_64
  pacemaker-libs-1.1.18-11.el7_5.3.x86_64
  pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64

Observed in support case with following versions:
  pacemaker-1.1.13-10.el7.x86_64
  pacemaker-cli-1.1.13-10.el7.x86_64
  pacemaker-cluster-libs-1.1.13-10.el7.x86_64
  pacemaker-libs-1.1.13-10.el7.x86_64

== How reproducible:
Always

== Steps to Reproduce:
1. Create resource outside of resource group with "op monitor on-fail=block"
  # pcs resource create outside_resource ocf:pacemaker:Dummy op monitor interval=10 on-fail=block
  At least 2 resources in resource group
  # pcs resource create inside_resource_2 ocf:pacemaker:Dummy --group grp
  # pcs resource create inside_resource_3 ocf:pacemaker:Dummy --group grp
  And dependency of outside_resource on non-last resource in the resource group
  # pcs constraint order inside_resource_2 then outside_resource
2. Let everything start up and then fail the 'monitor' operation on outside_resource
  (on node where outside_resource is running)
  # rm /run/Dummy-outside_resource.state

== Actual results:
Cluster detected failed 'monitor' operation on outside_resource and:
- blocked the outside_resource
- restarted resource inside_resource_3

== Expected results:
Cluster detects failed 'monitor' operation on outside_resource and:
- blocks the outside_resource

== Additional info:
Currently identified workaround is to not have order dependency on resources inside of resource group, but on resource group itself.
For the cluster seen here it would mean following constraint.

  # pcs constraint order grp then outside_resource

 Ordering Constraints:
   start grp then start outside_resource (kind:Mandatory)

In such case the inside_resource_3 is NOT restarted when outside_resource 'monitor' fails.

Attached to this BZ is the crm_report from 2-node testing cluster where this was reproduced - check procedure.txt for the outputs of commands and times of issue.

Issue also doesn't appear when "on-fail=block" is not used. If there is another group depending on 'grp' group it is not affected - resources in it doesn't restart.

Comment 2 Ken Gaillot 2018-08-14 17:37:15 UTC

I have confirmed this is a bug in pacemaker's policy engine, triggered when a resource with on-fail=block is ordered after a group member. Further investigation will be needed to come up with a fix.

Comment 3 Patrik Hagara 2018-12-06 12:51:31 UTC

qa_ack+

When a resource group consisting of at least two resources and an outside, non-grouped resource which has "on-fail=block" attribute set are configured with ordering constraint to start the outside resource after the first grouped resource, and the monitor operation for outside resource fails, then the failed resource MUST be blocked and the cluster MUST NOT restart any of the grouped resources.

Comment 4 Ken Gaillot 2019-04-01 16:17:21 UTC

Fixed upstream in upstream 2.0 branch (for RHEL 8) by commit a6af4a0, and in upstream 1.1 branch (for RHEL 7) by commit 8cfe743

This turned out to be extremely tricky to diagnose and extremely simple to fix.

The issue had to do with pacemaker's scheduler using the same code in the handling of two implicit orderings: "stop a resource before starting it" (i.e. restarts) and "stop a later group member before stopping an earlier group member" (i.e. group stop ordering). There was one condition in that handling that should apply only to the restart situation, so the fix was to apply that condition only when the "then" action is a start.

Comment 6 Michal Mazourek 2019-06-14 15:25:06 UTC

BEFORE (pacemaker-1.1.18-11.el7_5.3)
======

## Create resource outside of resource group with "op monitor on-fail=block"
# pcs resource create outside_resource ocf:pacemaker:Dummy op monitor interval=10 on-fail=block

## Create 2 resources in resource group
# pcs resource create inside_resource_2 ocf:pacemaker:Dummy --group grp
# pcs resource create inside_resource_3 ocf:pacemaker:Dummy --group grp

## Add dependency of outside_resource on non-last resource in the resource group
# pcs constraint order inside_resource_2 then outside_resource

## Snippet from cib to check resource key
# pcs cluster cib | grep inside_resource_3
...
<lrm_rsc_op id="inside_resource_3_last_0" operation_key="inside_resource_3_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="46:7:0:5b36875d-28f7-41de-8cb8-c8d08b70a7af" transition-magic="0:0;46:7:0:5b36875d-28f7-41de-8cb8-c8d08b70a7af"
...

## Fail the monitor operation on outside_resource
# rm /run/Dummy-outside_resource.state

## outside_resource is blocked
# pcs status | grep outside_resource
  outside_resource	(ocf::pacemaker:Dummy):	FAILED virt-041 (blocked)
  * outside_resource_monitor_10000 on virt-041 'not running' (7): call=42, status=complete, exitreason='',

## inside_resource_3 was restarted, resource keys are different
# pcs cluster cib | grep inside_resource_3
...
<lrm_rsc_op id="inside_resource_3_last_0" operation_key="inside_resource_3_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="44:10:0:5b36875d-28f7-41de-8cb8-c8d08b70a7af" transition-magic="0:0;44:10:0:5b36875d-28f7-41de-8cb8-c8d08b70a7af"
...

AFTER (pacemaker-1.1.20-5.el7) 
=====

## Create resource outside of resource group with "op monitor on-fail=block"
# pcs resource create outside_resource ocf:pacemaker:Dummy op monitor interval=10 on-fail=block

## Create 2 resources in resource group
# pcs resource create inside_resource_2 ocf:pacemaker:Dummy --group grp
# pcs resource create inside_resource_3 ocf:pacemaker:Dummy --group grp

## Add dependency of outside_resource on non-last resource in the resource group
# pcs constraint order inside_resource_2 then outside_resource

## Snippet from cib to check resource key
# pcs cluster cib | grep inside_resource_3
...
<lrm_rsc_op id="inside_resource_3_last_0" operation_key="inside_resource_3_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="45:6:0:82b64d0b-c15d-476b-a084-019f06c2ec21" transition-magic="0:0;45:6:0:82b64d0b-c15d-476b-a084-019f06c2ec21"
...

## Fail the monitor operation on outside_resource
# rm /run/Dummy-outside_resource.state

## outside_resource is blocked
# pcs status | grep outside_resource
 outside_resource	(ocf::pacemaker:Dummy):	FAILED virt-012 (blocked)
* outside_resource_monitor_10000 on virt-012 'not running' (7): call=42, status=complete, exitreason='',

## inside_resource_3 wasn't restarted, resource keys are the same
# pcs cluster cib | grep inside_resource_3
...
<lrm_rsc_op id="inside_resource_3_last_0" operation_key="inside_resource_3_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="45:6:0:82b64d0b-c15d-476b-a084-019f06c2ec21" transition-magic="0:0;45:6:0:82b64d0b-c15d-476b-a084-019f06c2ec21"
...

RESULT
======
Before the fix, cluster detected failed monitor operation on outside_resource and alongside with blocking the outside_resource, cluster also restarted resource inside_resource_3. Now, the failed resource is blocked and cluster doesn't restart any of the grouped resources.

Verified for pacemaker-1.1.20-5.el7

Comment 8 errata-xmlrpc 2019-08-06 12:53:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129