1734066 – A pacemaker_remoted node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 [rhel-8.0.0.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1734066 - A pacemaker_remoted node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189 [rhel-8.0.0.z]

Summary: A pacemaker_remoted node fails monitor (probe) and stop /start operations on ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Ken Gaillot
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-29 14:43 UTC by Oneata Mircea Teodor
Modified:	2022-09-13 15:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.0.1-4.el8_0.4
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred). Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely. Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again. Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely.
Clone Of:
Environment:
Last Closed:	2019-09-11 09:34:17 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4056221	0	None	None	None	2022-09-13 15:41:45 UTC
Red Hat Product Errata	RHBA-2019:2719	0	None	None	None	2019-09-11 09:34:18 UTC

Description Oneata Mircea Teodor 2019-07-29 14:43:21 UTC

This bug has been copied from bug#1721198 and has been proposed to be backported to 8.0.0 z-stream.
Devel and QA ack is needed for full approval of the zstream clone

Comment 1 Ken Gaillot 2019-08-05 22:43:48 UTC

QA: Test procedure from original Bug 1704870:

1. Configure a cluster with a bare metal Pacemaker Remote node whose connection resource has a reconnect_interval set, and a resource that will always run on the remote node. The cluster-recheck-interval should be the same or lower than the reconnect_interval. (The OpenStack setup does this for compute nodes, with reconnect_interval and cluster-recheck-interval of 60s, and the compute-unfence-trigger clone running on all compute nodes.)

2. pacemaker-remoted's shutdown must take long enough to be interrupted by the power cut. The OpenStack setup seems to do this reliably, but one way to ensure this would be to create a ocf:pacemaker:Dummy resource with an op_delay (long enough to manually see the resource is stopping and then kill the node) constrained to run on the remote node (pacemaker-remoted will wait for the dummy resource to stop before completing shutdown).

3. Start a graceful shutdown of the Pacemaker Remote daemon, and before it can complete, hard power down the node. For a physical node, you can leave acpid enabled in the remote node OS, so when pressing and holding the power button, acpid will notify systemd which will tell pacemaker-remoted to stop, then the power will drop. For a virtual node, you can run "systemctl stop pacemaker_remote" on the virtual node, then virsh destroy it from the host. In either case, you can ensure in pcs status that the delayed resource is "Stopping" when the node is lost.

4. After some time, the remote connection will be detected as lost.

Before the fix, there will be recovery problems such as fencing loops of the remote node. The logs will show actions scheduled for the remote node repeatedly failing with return code 189 (these failures don't show up in pcs status).

After the fix, the remote node will be fenced once, and the cluster will recover properly. There may be 189 codes associated with the initial connection loss, but none afterward.

Comment 3 pkomarov 2019-08-25 12:49:55 UTC

Verified, 

[stack@undercloud-0 ~]$ ansible overcloud_nodes -b -mshell -a'rpm -qa|grep pacemaker'

controller-0 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-cli-2.0.1-4.el8_0.3.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.3.noarch
pacemaker-2.0.1-4.el8_0.3.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-cluster-libs-2.0.1-4.el8_0.3.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-remote-2.0.1-4.el8_0.3.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.3.x86_64

controller-1 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-cli-2.0.1-4.el8_0.3.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.3.noarch
pacemaker-2.0.1-4.el8_0.3.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-cluster-libs-2.0.1-4.el8_0.3.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-remote-2.0.1-4.el8_0.3.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.3.x86_64

messaging-0 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-libs-2.0.1-4.el8_0.4.x86_64

compute-0 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-libs-2.0.1-4.el8_0.4.x86_64

controller-2 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-libs-2.0.1-4.el8_0.4.x86_64

database-0 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-cli-2.0.1-4.el8_0.3.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.3.noarch
pacemaker-2.0.1-4.el8_0.3.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-cluster-libs-2.0.1-4.el8_0.3.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-remote-2.0.1-4.el8_0.3.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.3.x86_64

database-1 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-cli-2.0.1-4.el8_0.3.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.3.noarch
pacemaker-2.0.1-4.el8_0.3.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-cluster-libs-2.0.1-4.el8_0.3.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-remote-2.0.1-4.el8_0.3.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.3.x86_64

messaging-1 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-libs-2.0.1-4.el8_0.4.x86_64

messaging-2 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-remote-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-libs-2.0.1-4.el8_0.4.x86_64

database-2 | CHANGED | rc=0 >>
pacemaker-cli-2.0.1-4.el8_0.4.x86_64
pacemaker-cli-2.0.1-4.el8_0.3.x86_64
pacemaker-schemas-2.0.1-4.el8_0.4.noarch
pacemaker-2.0.1-4.el8_0.4.x86_64
pacemaker-schemas-2.0.1-4.el8_0.3.noarch
pacemaker-2.0.1-4.el8_0.3.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-cluster-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-cluster-libs-2.0.1-4.el8_0.3.x86_64
puppet-pacemaker-0.7.3-0.20190719130411.4c06196.el8ost.noarch
pacemaker-remote-2.0.1-4.el8_0.3.x86_64
pacemaker-libs-2.0.1-4.el8_0.4.x86_64
pacemaker-libs-2.0.1-4.el8_0.3.x86_64


[root@overcloud-database-1 ~]# systemctl status pacemaker_remote
● pacemaker_remote.service - Pacemaker Remote executor daemon
   Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2019-08-25 12:28:59 UTC; 1min 11s ago
     Docs: man:pacemaker-remoted
           https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Remote/index.html
 Main PID: 391990 (pacemaker-remot)
    Tasks: 1
   Memory: 3.9M
   CGroup: /system.slice/pacemaker_remote.service
           └─391990 /usr/sbin/pacemaker-remoted

Aug 25 12:28:59 overcloud-database-1 systemd[1]: Started Pacemaker Remote executor daemon.
Aug 25 12:28:59 overcloud-database-1 pacemaker-remoted[391990]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log
Aug 25 12:28:59 overcloud-database-1 pacemaker-remoted[391990]: notice: Starting TLS listener on port 3121
Aug 25 12:29:19 overcloud-database-1 pacemaker-remoted[391990]: notice: Remote client connection accepted
Aug 25 12:29:21 overcloud-database-1 podman(galera-bundle-podman-0)[392646]: INFO: running container galera-bundle-podman-0 for the first time



#dummy on : overcloud-database-1
pcs resource create test_resource ocf:pacemaker:Dummy op_sleep=10
pcs constraint location test_resource prefers overcloud-database-1=INFINITY

[root@overcloud-controller-2 ~]# pcs config |grep -A 1 test_resource
 Resource: test_resource (class=ocf provider=pacemaker type=Dummy)
  Attributes: op_sleep=10
  Operations: migrate_from interval=0s timeout=20s (test_resource-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20s (test_resource-migrate_to-interval-0s)
              monitor interval=10s timeout=20s (test_resource-monitor-interval-10s)
              reload interval=0s timeout=20s (test_resource-reload-interval-0s)
              start interval=0s timeout=20s (test_resource-start-interval-0s)
              stop interval=0s timeout=20s (test_resource-stop-interval-0s)

--
  Resource: test_resource
    Enabled on: overcloud-database-1 (score:INFINITY) (id:location-test_resource-overcloud-database-1-INFINITY)
Ordering Constraints:


[root@overcloud-database-1 ~]# date && systemctl stop pacemaker_remote;echo b >/proc/sysrq-trigger 
Sun Aug 25 12:37:58 UTC 2019
packet_write_wait: Connection to 192.168.24.14 port 22: Broken pipe

#check initial rc=189 logs but then they disappear:
[root@overcloud-controller-2 cluster]# grep 'rc=189' /var/log/pacemaker/pacemaker.log
Aug 25 12:36:41 overcloud-controller-2 pacemaker-controld  [916685] (update_failcount) 	info: Updating failcount for galera-bundle-podman-0 on overcloud-database-1 after failed monitor: rc=189 (update=value++, time=1566736601)
Aug 25 12:36:41 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:36:41 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:37:54 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:39:06 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:40:18 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:41:30 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:42:42 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:43:55 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:45:07 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:46:19 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
Aug 25 12:47:31 overcloud-controller-2 pacemaker-schedulerd[916684] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-podman-0 on overcloud-database-1: unknown | rc=189
[root@overcloud-controller-2 cluster]# 
[root@overcloud-controller-2 cluster]# 
[root@overcloud-controller-2 cluster]# date
Sun Aug 25 12:48:29 UTC 2019

Comment 5 errata-xmlrpc 2019-09-11 09:34:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2719

Note You need to log in before you can comment on or make changes to this bug.