Bug 1570130 - Pending notify actions are recorded as complete
Summary: Pending notify actions are recorded as complete
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.5
Hardware: All
OS: All
urgent
urgent
Target Milestone: rc
: 7.6
Assignee: Ken Gaillot
QA Contact: Patrik Hagara
URL:
Whiteboard:
Depends On:
Blocks: 1570618
TreeView+ depends on / blocked
 
Reported: 2018-04-20 16:50 UTC by Ken Gaillot
Modified: 2018-10-30 07:59 UTC (History)
8 users (show)

Fixed In Version: pacemaker-1.1.19-2.el7
Doc Type: Bug Fix
Doc Text:
Cause: In order to ignore the result of clone notify actions, Pacemaker always recorded them as successful. Consequence: When record-pending is true (which is the default as of 7.5), pending clone notifications would appear to be successfully completed, and dependent actions could proceed before the notify action had completed. Fix: Pacemaker only records completed, not pending, clone notify actions as successful. Result: The correct ordering of actions is preserved even if record-pending is true.
Clone Of:
: 1570618 (view as bug list)
Environment:
Last Closed: 2018-10-30 07:57:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
Newer version of ocf:pacemaker:Stateful that supports a notify delay (5.37 KB, application/x-shellscript)
2018-04-20 16:54 UTC, Ken Gaillot
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3055 0 None None None 2018-10-30 07:59:04 UTC

Description Ken Gaillot 2018-04-20 16:50:43 UTC
Description of problem: Pacemaker ignores the result of clone notification actions. However it does this by recording every notify action as successfully completed. When record-pending is true (which is the default as of 7.5), this means the notification is treated as complete as soon as it is initiated, allowing dependent actions to proceed before the notification has actually completed, and causing the actual notification result to restart the transition as an "old event".


Version-Release number of selected component (if applicable): 7.5


How reproducible: Reliably


Steps to Reproduce:
1. Configure a cluster of at least two nodes.
2. Install the attached "notify-test" resource agent in /usr/lib/ocf/resource.d/pacemaker/ on all nodes. (This is a variant of ocf:pacemaker:Stateful that can make the notify action take longer.)
3. Configure a resource:
pcs resource create test ocf:pacemaker:notify-test notify_delay=4 clone notify=true
4. Start the cluster and wait for everything to start.
5. Check the system log for "Old event" messages from the crmd since the cluster start.

Actual results: Before the fix, there will be "Old event" messages.


Expected results: After the fix, there will be no "Old event" messages.

Comment 2 Ken Gaillot 2018-04-20 16:54:01 UTC
Created attachment 1424619 [details]
Newer version of ocf:pacemaker:Stateful that supports a notify delay

Comment 3 Ken Gaillot 2018-04-21 03:47:35 UTC
fixed upstream by commit 8e9a9d1a

Comment 6 Ken Gaillot 2018-07-06 02:22:37 UTC
The original fix for this turned out to be incomplete, so the latest build incorporates a complete fix.

The original fix, recording pending clone notifications as pending in the CIB, solved the immediate issue, but left the pending entries semi-permanently in the CIB, because the cluster does not record clone notification completions in the CIB.

The latest build avoids recording pending clone notifications in the CIB at all.

QA: The reproducer does not change. If desired, you can check that the live CIB has no lrm_rsc_op entries with operation="notify", neither while the notification is in progress nor when it has completed.

Comment 7 Patrik Hagara 2018-08-01 09:35:09 UTC
before:
=======

> [root@virt-264 ~]# rpm -q pacemaker
> pacemaker-1.1.18-11.el7.x86_64
> [root@virt-264 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-264 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test 
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-265:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-264 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-264 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-265.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
> Last updated: Tue Jul 31 14:58:04 2018
> Last change: Tue Jul 31 14:58:00 2018 by root via cibadmin on virt-264.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-264 ~]# grep "Old event" /var/log/cluster/corosync.log               
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# ssh virt-265 'grep "Old event" /var/log/cluster/corosync.log'
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_pre_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_pre_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:08 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> [root@virt-264 ~]# echo $?
> 0
> [root@virt-264 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 4


Before the fix, the post-start notify transition is aborted on the DC with an "Old event" message in its logs. Also, live CIB contains notify operations for the resource.


after:
======

> [root@virt-265 ~]# rpm -q pacemaker
> pacemaker-1.1.19-4.el7.x86_64
> [root@virt-265 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-265 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-267:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-265 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-265 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-266.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-4.el7-c3c624ea3d) - partition with quorum
> Last updated: Wed Aug  1 11:30:23 2018
> Last change: Wed Aug  1 11:30:17 2018 by root via cibadmin on virt-265.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-265 ~]# grep "Old event" /var/log/cluster/corosync.log
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log' 
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-267 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-265 ~]# echo $?                                                      
> 1
> [root@virt-265 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 0

After the fix, there is no "Old event" log message on any of the cluster nodes and no notify operation in the CIB. Marking verified.

Comment 9 errata-xmlrpc 2018-10-30 07:57:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055


Note You need to log in before you can comment on or make changes to this bug.