Bug 1570130

Summary:

Pending notify actions are recorded as complete

Product:

Red Hat Enterprise Linux 7

Reporter:

Ken Gaillot <kgaillot>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

Patrik Hagara <phagara>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.5

CC:

abeekhof, aherr, amit.banerjee, chjones, cluster-maint, dkinkead, mnovacek, phagara

Target Milestone:

Keywords:

ZStream

Target Release:

7.6

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

pacemaker-1.1.19-2.el7

Doc Type:

Bug Fix

Doc Text:

Cause: In order to ignore the result of clone notify actions, Pacemaker always recorded them as successful. Consequence: When record-pending is true (which is the default as of 7.5), pending clone notifications would appear to be successfully completed, and dependent actions could proceed before the notify action had completed. Fix: Pacemaker only records completed, not pending, clone notify actions as successful. Result: The correct ordering of actions is preserved even if record-pending is true.

Story Points:

---

Clone Of:

Clones:

1570618 (view as bug list)

Environment:

Last Closed:

2018-10-30 07:57:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1570618

Attachments:

Description	Flags
Newer version of ocf:pacemaker:Stateful that supports a notify delay	none

Description Ken Gaillot 2018-04-20 16:50:43 UTC

Description of problem: Pacemaker ignores the result of clone notification actions. However it does this by recording every notify action as successfully completed. When record-pending is true (which is the default as of 7.5), this means the notification is treated as complete as soon as it is initiated, allowing dependent actions to proceed before the notification has actually completed, and causing the actual notification result to restart the transition as an "old event".


Version-Release number of selected component (if applicable): 7.5


How reproducible: Reliably


Steps to Reproduce:
1. Configure a cluster of at least two nodes.
2. Install the attached "notify-test" resource agent in /usr/lib/ocf/resource.d/pacemaker/ on all nodes. (This is a variant of ocf:pacemaker:Stateful that can make the notify action take longer.)
3. Configure a resource:
pcs resource create test ocf:pacemaker:notify-test notify_delay=4 clone notify=true
4. Start the cluster and wait for everything to start.
5. Check the system log for "Old event" messages from the crmd since the cluster start.

Actual results: Before the fix, there will be "Old event" messages.


Expected results: After the fix, there will be no "Old event" messages.

Comment 2 Ken Gaillot 2018-04-20 16:54:01 UTC

Created attachment 1424619 [details]
Newer version of ocf:pacemaker:Stateful that supports a notify delay

Comment 3 Ken Gaillot 2018-04-21 03:47:35 UTC

fixed upstream by commit 8e9a9d1a

Comment 6 Ken Gaillot 2018-07-06 02:22:37 UTC

The original fix for this turned out to be incomplete, so the latest build incorporates a complete fix.

The original fix, recording pending clone notifications as pending in the CIB, solved the immediate issue, but left the pending entries semi-permanently in the CIB, because the cluster does not record clone notification completions in the CIB.

The latest build avoids recording pending clone notifications in the CIB at all.

QA: The reproducer does not change. If desired, you can check that the live CIB has no lrm_rsc_op entries with operation="notify", neither while the notification is in progress nor when it has completed.

Comment 7 Patrik Hagara 2018-08-01 09:35:09 UTC

before:
=======

> [root@virt-264 ~]# rpm -q pacemaker
> pacemaker-1.1.18-11.el7.x86_64
> [root@virt-264 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-264 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test 
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-265:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-264 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-264 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-264 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-265.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
> Last updated: Tue Jul 31 14:58:04 2018
> Last change: Tue Jul 31 14:58:00 2018 by root via cibadmin on virt-264.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-264.cluster-qe.lab.eng.brq.redhat.com virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-264 ~]# grep "Old event" /var/log/cluster/corosync.log               
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# ssh virt-265 'grep "Old event" /var/log/cluster/corosync.log'
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_pre_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_pre_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:04 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> Jul 31 14:58:08 [15956] virt-265.cluster-qe.lab.eng.brq.redhat.com       crmd:     info: abort_transition_graph:	Transition aborted by lrm_rsc_op.bz1570130_post_notify_start_0: Old event | cib=0.0.0 source=process_graph_event:498 path=/create_request_adv/crm_xml/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='bz1570130']/lrm_rsc_op[@id='bz1570130_post_notify_start_0'] complete=true
> [root@virt-264 ~]# echo $?
> 0
> [root@virt-264 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-264 ~]# echo $?
> 1
> [root@virt-264 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 4


Before the fix, the post-start notify transition is aborted on the DC with an "Old event" message in its logs. Also, live CIB contains notify operations for the resource.


after:
======

> [root@virt-265 ~]# rpm -q pacemaker
> pacemaker-1.1.19-4.el7.x86_64
> [root@virt-265 ~]# wget -qO /usr/lib/ocf/resource.d/pacemaker/notify-test https://bugzilla.redhat.com/attachment.cgi?id=1424619
> [root@virt-265 ~]# chmod 755 /usr/lib/ocf/resource.d/pacemaker/notify-test
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-266:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# scp /usr/lib/ocf/resource.d/pacemaker/notify-test virt-267:/usr/lib/ocf/resource.d/pacemaker/
> [root@virt-265 ~]# pcs resource create bz1570130 ocf:pacemaker:notify-test notify_delay=4
> Warning: changing a monitor operation interval from 10 to 11 to make the operation unique
> [root@virt-265 ~]# pcs resource clone bz1570130 notify=true
> [root@virt-265 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-266.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-4.el7-c3c624ea3d) - partition with quorum
> Last updated: Wed Aug  1 11:30:23 2018
> Last change: Wed Aug  1 11:30:17 2018 by root via cibadmin on virt-265.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 3 resources configured
> 
> Online: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  Clone Set: bz1570130-clone [bz1570130]
>      Started: [ virt-265.cluster-qe.lab.eng.brq.redhat.com virt-266.cluster-qe.lab.eng.brq.redhat.com virt-267.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-265 ~]# grep "Old event" /var/log/cluster/corosync.log
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-266 'grep "Old event" /var/log/cluster/corosync.log' 
> [root@virt-265 ~]# echo $?
> 1
> [root@virt-265 ~]# ssh virt-267 'grep "Old event" /var/log/cluster/corosync.log'
> [root@virt-265 ~]# echo $?                                                      
> 1
> [root@virt-265 ~]# pcs cluster cib | gxpp -x '/cib/status/node_state/lrm/lrm_resources/lrm_resource[@id="bz1570130"]/lrm_rsc_op[@operation="notify"]' | wc -l
> 0

After the fix, there is no "Old event" log message on any of the cluster nodes and no notify operation in the CIB. Marking verified.

Comment 9 errata-xmlrpc 2018-10-30 07:57:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055