+++ This bug was initially created as a clone of Bug #1704870 +++ Description of problem: A pacemaker_remote node is fenced over and over. We thought that issue might have been caused by bz1701256, but after fixing the cib.xml the issue continued to occur. The remote node: $ grep Dummy blvsbgdellpod-overcloud-controller-1/var/lib/pacemaker/cib/cib.xml -A 12 <primitive class="ocf" id="compute-unfence-trigger" provider="pacemaker" type="Dummy"> <operations> <op id="compute-unfence-trigger-migrate_from-interval-0s" interval="0s" name="migrate_from" timeout="20"/> <op id="compute-unfence-trigger-migrate_to-interval-0s" interval="0s" name="migrate_to" timeout="20"/> <op id="compute-unfence-trigger-monitor-interval-10" interval="10" name="monitor" timeout="20"/> <op id="compute-unfence-trigger-reload-interval-0s" interval="0s" name="reload" timeout="20"/> <op id="compute-unfence-trigger-start-interval-0s" interval="0s" name="start"/> <op id="compute-unfence-trigger-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="20"/> </operations> <meta_attributes id="compute-unfence-trigger-meta_attributes"> <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/> </meta_attributes> </primitive> The pacemaker remote node started: Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: notice: lrmd_init_remote_tls_server: Starting TLS listener on port 3121 Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: notice: bind_and_listen: Listening on address :: Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_ro Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_rw Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_shm Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: attrd Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: stonith-ng Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: crmd Apr 29 15:31:31 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: info: main: Starting Apr 29 15:31:52 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Apr 29 15:31:52 [38552] blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted: error: handle_new_connection: Error in connection setup (38552-45729-15): Remote I/O error (121) $ grep "* pacemaker_remote.service" sos_commands/systemd/systemctl_status_--all -A 20 | prespace * pacemaker_remote.service - Pacemaker Remote Service Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2019-04-29 15:31:31 UTC; 12min ago Docs: man:pacemaker-remoted https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/index.html Main PID: 38552 (pacemaker_remot) Tasks: 1 Memory: 1.9M CGroup: /system.slice/pacemaker_remote.service `-38552 /usr/sbin/pacemaker_remoted Apr 29 15:43:19 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: No ipc providers available for uid 0 gid 0 Apr 29 15:43:19 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: Error in connection setup (38552-56493-15): Remote I/O error (121) Apr 29 15:43:19 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: No ipc providers available for uid 0 gid 0 Apr 29 15:43:19 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: Error in connection setup (38552-56498-15): Remote I/O error (121) Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: No ipc providers available for uid 0 gid 0 Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: Error in connection setup (38552-57542-15): Remote I/O error (121) Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: No ipc providers available for uid 0 gid 0 Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: Error in connection setup (38552-57544-15): Remote I/O error (121) Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: No ipc providers available for uid 0 gid 0 Apr 29 15:43:27 blvsbgdellpod-overcloud-computeovsdpdk-1 pacemaker_remoted[38552]: error: Error in connection setup (38552-57600-15): Remote I/O error (121 ----- The controller node that is the DC: The crm_mon output shows an error and that the remote node is online and running: RemoteOnline: [ blvsbgdellpod-overcloud-computeovsdpdk-0 blvsbgdellpod-overcloud-computeovsdpdk-1 ....} compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED blvsbgdellpod-overcloud-computeovsdpdk-1 (blocked) * compute-unfence-trigger_stop_0 on blvsbgdellpod-overcloud-computeovsdpdk-1 'unknown' (189): call=6767, status=Error, exitreason='', last-rc-change='Mon Apr 29 15:38:38 2019', queued=0ms, exec=0ms Apr 29 15:37:59 [124187] blvsbgdellpod-overcloud-controller-1 pengine: info: common_print: compute-unfence-trigger (ocf::pacemaker:Dummy): FAILED blvsbgdellpod-overcloud-computeovsdpdk-1 Apr 29 15:37:59 [124187] blvsbgdellpod-overcloud-controller-1 pengine: info: RecurringOp: Start recurring monitor (10s) for compute-unfence-trigger:1 on blvsbgdellpod-overcloud-computeovsdpdk-1 Apr 29 15:37:59 [124187] blvsbgdellpod-overcloud-controller-1 pengine: notice: LogAction: * Recover compute-unfence-trigger:1 ( blvsbgdellpod-overcloud-computeovsdpdk-1 ) Apr 29 15:37:59 [124188] blvsbgdellpod-overcloud-controller-1 crmd: notice: te_rsc_command: Initiating stop operation compute-unfence-trigger_stop_0 on blvsbgdellpod-overcloud-computeovsdpdk-1 | action 74 [...] Apr 29 15:37:59 [124188] blvsbgdellpod-overcloud-controller-1 crmd: warning: status_from_rc: Action 74 (compute-unfence-trigger_stop_0) on blvsbgdellpod-overcloud-computeovsdpdk-1 failed (target: 0 vs. rc: 189): Error Apr 29 15:37:59 [124188] blvsbgdellpod-overcloud-controller-1 crmd: notice: abort_transition_graph: Transition aborted by operation compute-unfence-trigger_stop_0 'modify' on blvsbgdellpod-overcloud-controller-2: Event failed | magic=4:189;74:18136:0:626811f2-c61d-4673-a380-2c5a7c96b969 cib=0.379.127 source=match_graph_event:308 complete=false Apr 29 15:37:59 [124188] blvsbgdellpod-overcloud-controller-1 crmd: info: match_graph_event: Action compute-unfence-trigger_stop_0 (74) confirmed on blvsbgdellpod-overcloud-computeovsdpdk-1 (rc=189) Apr 29 15:37:59 [124188] blvsbgdellpod-overcloud-controller-1 crmd: info: update_failcount: Updating failcount for compute-unfence-trigger on blvsbgdellpod-overcloud-computeovsdpdk-1 after failed stop: rc=189 (update=INFINITY, time=1556552279) ----- * In the latest sosreports, rh support tried to debug issue and at least get the cluster in a state so that it would stop being fenced by: remote node standby, unmanaging resources, etc. Eventually they was able to get remote node to rejoin, and start resources without being fenced. Here is pastebin description of what they did after making the cib.xml changes: - https://privatebin-it-iso.int.open.paas.redhat.com/?7967e5e013994445#YC2qFkCLq3UPEzX90aALjip0JNK91uoTOMqYPfszYRk= Version-Release number of selected component (if applicable): Cluster Nodes: $ grep -h -ie pacemaker */installed-rpms | awk '{print $1}' | sort | uniq -c 3 ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch 3 pacemaker-1.1.19-8.el7_6.4.x86_64 3 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 3 pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 3 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 3 pacemaker-remote-1.1.19-8.el7_6.4.x86_64 3 puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch Remote Node: $ grep -h -ie pacemaker blvsbgdellpod-overcloud-computeovsdpdk-9/installed-rpms | awk '{print $1}' | sort | uniq -c 1 ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch 1 pacemaker-1.1.19-8.el7_6.4.x86_64 1 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 1 pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 1 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 1 pacemaker-remote-1.1.19-8.el7_6.4.x86_64 1 puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch How reproducible: Every time. Steps to Reproduce: 1. systemctl enable pacemaker_remoted Actual results: The remote node fails to communicate with the cluster then the remote node is fenced because a resource fails to stop. Expected results: The remote node should be able to successfully participate in the cluster and start the resources. Additional info: We thought that issue might be because of having "requires fencing" in incorrect place. See bugzilla that was previously opened, but is being worked by openstack team to fix the installer issue: - 1701256 – [osp13] A pacemaker_remoted node fails monitor (probe) and stop operations on a resource because it returns "rc=189" https://bugzilla.redhat.com/show_bug.cgi?id=1701256 That does not appear to be issue as customer applied the requested changes to cib.xml and issue did not resolve itself. # pcs resource meta compute-unfence-trigger requires=unfencing # pcs resource update compute-unfence-trigger op start interval=0 They made the changes to the cib.xml at around the following time: Apr 29 15:03:45 [124183] blvsbgdellpod-overcloud-controller-1 cib: info: cib_perform_op: ++ <nvpair id="compute-unfence-trigger-meta_attributes-requires" name="requires" value="unfencing"/> Apr 29 15:03:45 [124184] blvsbgdellpod-overcloud-controller-1 stonith-ng: info: build_device_from_xml: The fencing device 'stonith-fence_compute-fence-nova' requires unfencing Apr 29 15:04:02 [124183] blvsbgdellpod-overcloud-controller-1 cib: info: cib_perform_op: + /cib/configuration/resources/clone[@id='compute-unfence-trigger- clone']/primitive[@id='compute-unfence-trigger']/operations/op[@id='compute-unfence-trigger-start-interval-0s']: @requires= Apr 29 15:04:33 [124183] blvsbgdellpod-overcloud-controller-1 cib: info: cib_perform_op: + /cib/configuration/resources/clone[@id='compute-unfence-trigger- clone']/primitive[@id='compute-unfence-trigger']/operations/op[@id='compute-unfence-trigger-start-interval-0s']: @requires= ----- The sosreports are located here: http://file.rdu.redhat.com/~bfrank/sosreport-blvsbgdellpod-overcloud-computeovsdpdk-1-2019-04-29-zkipctp.tar.xz is one of them http://file.rdu.redhat.com/~bfrank/sosreport-blvsbgdellpod-overcloud-controller-1-2019-04-29-tcmxtaf.tar.xz <snip> --- Additional comment from Ken Gaillot on 2019-06-03 22:30:25 UTC --- From the pattern of events, I'm guessing the test consists of a power cut, where the cut is configured as a graceful power down followed by a hard power down a short time later. (This is often configurable in IPMI as the behavior of the power button or an IPMI command.) This test thus causes pacemaker_remoted to initiate a graceful shutdown, which is then interrupted by a hard power cut. That exposed a bug in pacemaker's scheduling when a remote node is lost in the middle of shutting down. When the remote node is lost, pacemaker schedules fencing of the remote node, implied stops of any resources running on the remote node, and a stop of the remote connection, in that order. The connection stop is ordered last because in the general case, connections aren't stopped until any actions for the remote node have been completed. In this case, those actions are the implied stops; the ordering is not actually needed in this case, since the connection is not needed to execute the stops. Because of this unnecessary ordering, once the remote node is successfully fenced, the scheduler runs again and finds that the remote node is cleanly down (because of the fencing), yet the remote connection is in a started and failed condition (because the connection stop has been waiting on the fencing and hasn't been done yet). Because the node appears clean and the connection appears started, a resource probe is scheduled on the node. This probe fails because the connection is not actually active. The fixes will be to remove the unnecessary ordering, and to refrain from scheduling actions (such as the probe) on a remote node when the connection is failed and not being started again. --- Additional comment from Ken Gaillot on 2019-06-14 19:44:03 UTC --- This has been fixed in the upstream 1.1 branch by these commits: c20f8920 - don't order implied stops relative to a remote connection 938e99f2 - remote state is failed if node is shutting down with connection failure <snip> --- Additional comment from Ken Gaillot on 2019-06-17 14:42:37 UTC --- QA: Reproducer is: 1. Configure a cluster with a bare metal Pacemaker Remote node whose connection resource has a reconnect_interval set (the OpenStack setup does this for compute nodes), and a resource that will always run on the remote node (OpenStack has the compute-unfence-trigger clone). 2. pacemaker-remoted's shutdown must take long enough to be interrupted by the power cut. The OpenStack setup seems to do this reliably, but one way to ensure this would be to create a dummy resource with a 5-second delay constrained to run on the remote node (pacemaker-remoted will wait for the dummy resource to stop before completing shutdown). 3. Start a graceful shutdown of the Pacemaker Remote daemon, and before it can complete, hard power down the node. The easiest way to do this is to leave acpid enabled in the remote node OS, and leave IPMI configured to send the power cut signal to the OS then cut power a few seconds later; pressing the power button (or cutting power from the IPMI interface if it's configured to behave similarly) will then cause acpid to notify systemd which will tell pacemaker-remoted to stop. Before: Problems like fencing loops of the remote node, and failures with return code 189 being reported on the remote node. After: Cluster recovers from power cut.
The fix is already in the 8.1 build via the rebase Bug 1695737, but I made this separate BZ so we could ask for an 8.0.0 z-stream.
This bug has been copied as 8.0.0 z-stream bug#1734066 and now must be resolved in the current update release, set blocker flag.
Verified , verification steps at : https://bugzilla.redhat.com/show_bug.cgi?id=1734066#c3
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3385