1570130 – Pending notify actions are recorded as complete

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1570130 - Pending notify actions are recorded as complete

Summary: Pending notify actions are recorded as complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.5
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.6
Assignee:	Ken Gaillot
QA Contact:	Patrik Hagara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1570618
TreeView+	depends on / blocked

Reported:	2018-04-20 16:50 UTC by Ken Gaillot
Modified:	2018-10-30 07:59 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-1.1.19-2.el7
Doc Type:	Bug Fix
Doc Text:	Cause: In order to ignore the result of clone notify actions, Pacemaker always recorded them as successful. Consequence: When record-pending is true (which is the default as of 7.5), pending clone notifications would appear to be successfully completed, and dependent actions could proceed before the notify action had completed. Fix: Pacemaker only records completed, not pending, clone notify actions as successful. Result: The correct ordering of actions is preserved even if record-pending is true.
Clone Of:
Clones:	1570618 (view as bug list)
Environment:
Last Closed:	2018-10-30 07:57:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Newer version of ocf:pacemaker:Stateful that supports a notify delay (5.37 KB, application/x-shellscript) 2018-04-20 16:54 UTC, Ken Gaillot	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3055	0	None	None	None	2018-10-30 07:59:04 UTC

Description Ken Gaillot 2018-04-20 16:50:43 UTC

Description of problem: Pacemaker ignores the result of clone notification actions. However it does this by recording every notify action as successfully completed. When record-pending is true (which is the default as of 7.5), this means the notification is treated as complete as soon as it is initiated, allowing dependent actions to proceed before the notification has actually completed, and causing the actual notification result to restart the transition as an "old event".


Version-Release number of selected component (if applicable): 7.5


How reproducible: Reliably


Steps to Reproduce:
1. Configure a cluster of at least two nodes.
2. Install the attached "notify-test" resource agent in /usr/lib/ocf/resource.d/pacemaker/ on all nodes. (This is a variant of ocf:pacemaker:Stateful that can make the notify action take longer.)
3. Configure a resource:
pcs resource create test ocf:pacemaker:notify-test notify_delay=4 clone notify=true
4. Start the cluster and wait for everything to start.
5. Check the system log for "Old event" messages from the crmd since the cluster start.

Actual results: Before the fix, there will be "Old event" messages.


Expected results: After the fix, there will be no "Old event" messages.

Comment 2 Ken Gaillot 2018-04-20 16:54:01 UTC

Created attachment 1424619 [details]
Newer version of ocf:pacemaker:Stateful that supports a notify delay

Comment 3 Ken Gaillot 2018-04-21 03:47:35 UTC

fixed upstream by commit 8e9a9d1a

Comment 6 Ken Gaillot 2018-07-06 02:22:37 UTC

The original fix for this turned out to be incomplete, so the latest build incorporates a complete fix.

The original fix, recording pending clone notifications as pending in the CIB, solved the immediate issue, but left the pending entries semi-permanently in the CIB, because the cluster does not record clone notification completions in the CIB.

The latest build avoids recording pending clone notifications in the CIB at all.

QA: The reproducer does not change. If desired, you can check that the live CIB has no lrm_rsc_op entries with operation="notify", neither while the notification is in progress nor when it has completed.

Comment 7 Patrik Hagara 2018-08-01 09:35:09 UTC

before:
=======

> [root@virt-264 ~]# rpm -q pacemaker
> pacemaker-1.1.18-11.el7.x86_64
> [root@virt-264 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-264 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test 
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-265:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-264 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-264 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-265.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
> Last updated: Tue Jul 31 14:58:04 2018
> Last change: Tue Jul 31 14:58:00 2018 by root via cibadmin on virt-264.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-264 ~]# grep "Old event" /var/log/cluster/corosync.log               
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# ssh virt-265 'grep "Old event" /var/log/cluster/corosync.log'
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_pre_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_pre_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:08 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> [root@virt-264 ~]# echo $?
> 0
> [root@virt-264 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 4


Before the fix, the post-start notify transition is aborted on the DC with an "Old event" message in its logs. Also, live CIB contains notify operations for the resource.


after:
======

> [root@virt-265 ~]# rpm -q pacemaker
> pacemaker-1.1.19-4.el7.x86_64
> [root@virt-265 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-265 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-267:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-265 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-265 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-266.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-4.el7-c3c624ea3d) - partition with quorum
> Last updated: Wed Aug  1 11:30:23 2018
> Last change: Wed Aug  1 11:30:17 2018 by root via cibadmin on virt-265.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-265 ~]# grep "Old event" /var/log/cluster/corosync.log
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log' 
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-267 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-265 ~]# echo $?                                                      
> 1
> [root@virt-265 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 0

After the fix, there is no "Old event" log message on any of the cluster nodes and no notify operation in the CIB. Marking verified.

Comment 9 errata-xmlrpc 2018-10-30 07:57:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055

Note You need to log in before you can comment on or make changes to this bug.