Bug 1704870

Summary:	A pacemaker_remoted node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
Product:	Red Hat Enterprise Linux 7	Reporter:	Shane Bradley <sbradley>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.6	CC:	abeekhof, aherr, cfeist, cluster-maint, ctowsley, jruemker, kgaillot, lmiccini, phagara, pkomarov, rcadova, sbradley, tkajinam, ykulkarn
Target Milestone:	pre-dev-freeze	Keywords:	ZStream
Target Release:	7.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-1.1.21-1.el7	Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely.	Story Points:	---
Clone Of:
Clones:	1721198 1732335 (view as bug list)		Environment:
Last Closed:	2020-03-31 19:41:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1703946, 1721198, 1729984, 1732335

Comment 8 Ken Gaillot 2019-06-03 22:30:25 UTC

From the pattern of events, I'm guessing the test consists of a power cut, where the cut is configured as a graceful power down followed by a hard power down a short time later. (This is often configurable in IPMI as the behavior of the power button or an IPMI command.)

This test thus causes pacemaker_remoted to initiate a graceful shutdown, which is then interrupted by a hard power cut. That exposed a bug in pacemaker's scheduling when a remote node is lost in the middle of shutting down.

When the remote node is lost, pacemaker schedules fencing of the remote node, implied stops of any resources running on the remote node, and a stop of the remote connection, in that order. The connection stop is ordered last because in the general case, connections aren't stopped until any actions for the remote node have been completed. In this case, those actions are the implied stops; the ordering is not actually needed in this case, since the connection is not needed to execute the stops.

Because of this unnecessary ordering, once the remote node is successfully fenced, the scheduler runs again and finds that the remote node is cleanly down (because of the fencing), yet the remote connection is in a started and failed condition (because the connection stop has been waiting on the fencing and hasn't been done yet). Because the node appears clean and the connection appears started, a resource probe is scheduled on the node. This probe fails because the connection is not actually active.

The fixes will be to remove the unnecessary ordering, and to refrain from scheduling actions (such as the probe) on a remote node when the connection is failed and not being started again.

Comment 9 Ken Gaillot 2019-06-14 19:44:03 UTC

This has been fixed in the upstream 1.1 branch by these commits:

c20f8920 - don't order implied stops relative to a remote connection
938e99f2 - remote state is failed if node is shutting down with connection failure

Comment 13 Ken Gaillot 2019-06-17 14:42:37 UTC

QA: Reproducer is:

1. Configure a cluster with a bare metal Pacemaker Remote node whose connection resource has a reconnect_interval set (the OpenStack setup does this for compute nodes), and a resource that will always run on the remote node (OpenStack has the compute-unfence-trigger clone).

2. pacemaker-remoted's shutdown must take long enough to be interrupted by the power cut. The OpenStack setup seems to do this reliably, but one way to ensure this would be to create a dummy resource with a 5-second delay constrained to run on the remote node (pacemaker-remoted will wait for the dummy resource to stop before completing shutdown).

3. Start a graceful shutdown of the Pacemaker Remote daemon, and before it can complete, hard power down the node. The easiest way to do this is to leave acpid enabled in the remote node OS, and leave IPMI configured to send the power cut signal to the OS then cut power a few seconds later; pressing the power button (or cutting power from the IPMI interface if it's configured to behave similarly) will then cause acpid to notify systemd which will tell pacemaker-remoted to stop. 

Before: Problems like fencing loops of the remote node, and failures with return code 189 being reported on the remote node.

After: Cluster recovers from power cut.

Comment 26 Ken Gaillot 2019-07-16 20:41:49 UTC

QA: Updating Comment 13, here is the reproducer with more detail:

1. Configure a cluster with a bare metal Pacemaker Remote node whose connection resource has a reconnect_interval set, and a resource that will always run on the remote node. The cluster-recheck-interval should be the same or lower than the reconnect_interval. (The OpenStack setup does this for compute nodes, with reconnect_interval and cluster-recheck-interval of 60s, and the compute-unfence-trigger clone running on all compute nodes.)

2. pacemaker-remoted's shutdown must take long enough to be interrupted by the power cut. The OpenStack setup seems to do this reliably, but one way to ensure this would be to create a ocf:pacemaker:Dummy resource with an op_delay (long enough to manually see the resource is stopping and then kill the node) constrained to run on the remote node (pacemaker-remoted will wait for the dummy resource to stop before completing shutdown).

3. Start a graceful shutdown of the Pacemaker Remote daemon, and before it can complete, hard power down the node. For a physical node, you can leave acpid enabled in the remote node OS, so when pressing and holding the power button, acpid will notify systemd which will tell pacemaker-remoted to stop, then the power will drop. For a virtual node, you can run "systemctl stop pacemaker_remote" on the virtual node, then virsh destroy it from the host. In either case, you can ensure in pcs status that the delayed resource is "Stopping" when the node is lost.

4. After some time, the remote connection will be detected as lost.

Before the fix, there will be recovery problems such as fencing loops of the remote node. The logs will show actions scheduled for the remote node repeatedly failing with return code 189 (these failures don't show up in pcs status).

After the fix, the remote node will be fenced once, and the cluster will recover properly. There may be 189 codes associated with the initial connection loss, but none afterward.

Comment 27 pkomarov 2019-07-18 11:24:03 UTC

Verified , 

#check hotfix is in place: 
[stack@undercloud-0 ~]$ ansible overcloud -mshell -b -a'rpm -qa|grep pacemaker'
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because yum, dnf or zypper is insufficient you
can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.

compute-0 | SUCCESS | rc=0 >>
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-libs-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.5.0.1.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.5.0.1.x86_64

compute-1 | SUCCESS | rc=0 >>
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-libs-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.5.0.1.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.5.0.1.x86_64

controller-0 | SUCCESS | rc=0 >>
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-libs-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.5.0.1.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.5.0.1.x86_64

controller-2 | SUCCESS | rc=0 >>
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-libs-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.5.0.1.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.5.0.1.x86_64

controller-1 | SUCCESS | rc=0 >>
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-libs-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-remote-1.1.19-8.el7_6.5.0.1.x86_64
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch
pacemaker-cli-1.1.19-8.el7_6.5.0.1.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.5.0.1.x86_64


#create resource with operations delay: 

[root@compute-1 ~]# pcs config |grep -A 1 test_resource
 Resource: test_resource (class=ocf provider=pacemaker type=Dummy)
  Attributes: op_sleep=10
  Operations: migrate_from interval=0s timeout=20 (test_resource-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20 (test_resource-migrate_to-interval-0s)
              monitor interval=10 timeout=20 (test_resource-monitor-interval-10)
              reload interval=0s timeout=20 (test_resource-reload-interval-0s)
              start interval=0s timeout=20 (test_resource-start-interval-0s)
              stop interval=0s timeout=20 (test_resource-stop-interval-0s)

--
  Resource: test_resource
    Enabled on: compute-1 (score:INFINITY) (id:location-test_resource-compute-1-INFINITY)

#test_resource is started on compute-1: 
pcs status: 
[...]
 test_resource	(ocf::pacemaker:Dummy):	Started compute-1


#verification steps : 
#current date : 
Thu Jul 18 11:05:15 UTC 2019

#we do [root@compute-1 ~]# date && systemctl stop pacemaker_remote
#then before the cluster can respond we do : 
[root@hac-01 ~]# virsh destroy compute-1
Domain compute-1 destroyed

#now in the logs we see the cluster responding with rc=189 , but only initially ...

Jul 18 11:10:26 [904700] controller-2        cib:     info: cib_perform_op:     +  /cib:  @num_updates=9
Jul 18 11:10:26 [904700] controller-2        cib:     info: cib_perform_op:     ++ /cib/status/node_state[@id='compute-1']:  <transient_attributes id="compute-1"/>
Jul 18 11:10:26 [904700] controller-2        cib:     info: cib_perform_op:     ++                                             <instance_attributes id="status-compute-1">
Jul 18 11:10:26 [904700] controller-2        cib:     info: cib_perform_op:     ++                                               <nvpair id="status-compute-1-shutdown" name="shutdown" value="1563448226"/>


Jul 18 11:11:18 [904700] controller-2        cib:     info: cib_perform_op:     +  /cib/status/node_state[@id='compute-1']/lrm[@id='compute-1']/lrm_resources/lrm_resource[@id='test_resource']/lrm_rsc_op[@id='test_resource_last_0']:  @transition-magic=4:189;197:38:0:5d309b52-17a3-44eb-b037-124924b092ac, @call-id=208, @rc-code=189, @op-status=4
Jul 18 11:11:18 [904703] controller-2      attrd:     info: write_attribute:    Sent update 24 with 1 changes for fail-count-compute-1#monitor_60000, id=<n/a>, set=(null)
Jul 18 11:11:18 [904700] controller-2        cib:     info: cib_perform_op:     ++ /cib/status/node_state[@id='compute-1']/lrm[@id='compute-1']/lrm_resources/lrm_resource[@id='test_resource']:  <lrm_rsc_op id="test_resource_last_failure_0" operation_key="test_resource_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="197:38:0:5d309b52-17a3-44eb-b037-124924b092ac" transition-magic="4:189;197:38:0:5d309b52-17a3-44eb-b037-124924b092ac" exit-reason="" on_node="
Jul 18 11:11:18 [904703] controller-2      attrd:     info: attrd_peer_update:  Setting last-failure-compute-1#monitor_60000[controller-0]: (null) -> 1563448278 from controller-2
Jul 18 11:11:18 [904703] controller-2      attrd:     info: write_attribute:    Sent update 25 with 1 changes for last-failure-compute-1#monitor_60000, id=<n/a>, set=(null)
Jul 18 11:11:18 [904700] controller-2        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=controller-0/crmd/231, version=0.98.12)
Jul 18 11:11:18 [904700] controller-2        cib:     info: cib_process_request:        Forwarding cib_modify operation for section status to all (origin=local/attrd/24)
Jul 18 11:11:18 [904700] controller-2        cib:     info: cib_process_request:        Forwarding cib_modify operation for section status to all (origin=local/attrd/25)
Jul 18 11:11:18 [904705] controller-2       crmd:  warning: status_from_rc:     Action 197 (test_resource_stop_0) on compute-1 failed (target: 0 vs. rc: 189): Error
Jul 18 11:11:18 [904705] controller-2       crmd:     info: abort_transition_graph:     Transition aborted by operation test_resource_stop_0 'modify' on controller-0: Event failed | magic=4:189;197:38:0:5d309b52-17a3-44eb-b037-124924b092ac cib=0.98.12 source=match_graph_event:308 complete=false
Jul 18 11:11:18 [904705] controller-2       crmd:     info: match_graph_event:  Action test_resource_stop_0 (197) confirmed on compute-1 (rc=189)
Jul 18 11:11:18 [904705] controller-2       crmd:     info: update_failcount:   Updating failcount for test_resource on compute-1 after failed stop: rc=189 (update=INFINITY, time=1563448278)
Jul 18 11:11:18 [904705] controller-2       crmd:     info: process_graph_event:        Detected action (38.197) test_resource_stop_0.208=unknown: failed

#after acouple of min there is no mention of rc=189 logged : 
[root@controller-2 cluster]# date
Thu Jul 18 11:16:12 UTC 2019

[root@controller-2 cluster]# grep 'rc=189' corosync.log
Jul 18 11:11:18 [904705] controller-2       crmd:     info: match_graph_event:	Action test_resource_stop_0 (197) confirmed on compute-1 (rc=189)
Jul 18 11:11:18 [904705] controller-2       crmd:     info: update_failcount:	Updating failcount for test_resource on compute-1 after failed stop: rc=189 (update=INFINITY, time=1563448278)
Jul 18 11:11:18 [904704] controller-2    pengine:  warning: unpack_rsc_op_failure:	Processing failed stop of test_resource on compute-1: unknown | rc=189
Jul 18 11:11:18 [904704] controller-2    pengine:  warning: unpack_rsc_op_failure:	Processing failed stop of test_resource on compute-1: unknown | rc=189
Jul 18 11:11:18 [904704] controller-2    pengine:  warning: unpack_rsc_op_failure:	Processing failed stop of test_resource on compute-1: unknown | rc=189
Jul 18 11:11:18 [904704] controller-2    pengine:  warning: unpack_rsc_op_failure:	Processing failed stop of test_resource on compute-1: unknown | rc=189

Comment 31 pkomarov 2019-08-15 11:22:30 UTC

Verified , 
details in https://bugzilla.redhat.com/show_bug.cgi?id=1704870#c27

Comment 33 errata-xmlrpc 2020-03-31 19:41:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1032