Bug 1898457
Summary: | systemctl stop pacemaker_remote hangs forever | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> |
Component: | pacemaker | Assignee: | pkomarov |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 8.3 | CC: | cluster-maint, dabarzil, kgaillot, lmiccini, msmazova, nwahl, pkomarov |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 8.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.0.5-6.el8 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Pacemaker marked a remote node's connection as needing a stop when unpacking the remote node's history.
Consequence: If a cluster node's history occurred later in the CIB status section than the remote node's, and contained resource history for the remote connection, the indication of needing a stop would be lost. If this happened at cluster shutdown, the cluster would hang until the shutdown timed out.
Fix: The connection is marked as needing a stop after all node history has been processed.
Result: The stop is correctly scheduled, and cluster shutdown proceeds as expected.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-18 15:26:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 4
Ken Gaillot
2020-11-20 17:38:03 UTC
The issue occurs when there are non-remote nodes in <status> after the remote node that's being stopped. [root@fastvm-rhel-8-0-23 pacemaker]# grep '<node_state' /tmp/sos_cib_good.xml <node_state id="7" uname="messaging-0" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_state_transition"> ...Omitting some non-remote nodes for space... <node_state id="2" uname="controller-1" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state remote_node="true" id="rabbitmq-bundle-1" uname="rabbitmq-bundle-1" in_ccm="true" crm-debug-origin="do_state_transition"> <node_state remote_node="true" id="rabbitmq-bundle-2" uname="rabbitmq-bundle-2" in_ccm="true" crm-debug-origin="do_state_transition"> <node_state remote_node="true" id="compute-0" uname="compute-0" in_ccm="true" crm-debug-origin="do_state_transition"> ...The rest are all remote nodes... [root@fastvm-rhel-8-0-23 pacemaker]# grep '<node_state' /tmp/sos_cib_bad.xml <node_state id="4" uname="database-0" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> ...Omitting some non-remote nodes for space... <node_state id="8" uname="messaging-1" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state id="compute-0" remote_node="true" uname="compute-0" in_ccm="true" crm-debug-origin="do_state_transition"> <node_state id="compute-1" remote_node="true" uname="compute-1" in_ccm="false" crm-debug-origin="do_state_transition"> <node_state id="7" uname="messaging-0" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state id="rabbitmq-bundle-1" remote_node="true" uname="rabbitmq-bundle-1" in_ccm="true" crm-debug-origin="do_update_resource" node_fenced="0"> ...Omitting some remote nodes for space... <node_state id="9" uname="messaging-2" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state remote_node="true" id="rabbitmq-bundle-2" uname="rabbitmq-bundle-2" in_ccm="true" crm-debug-origin="do_update_resource" node_fenced="0"> In the /tmp/sos_cib_bad.xml, if I move the messaging-0 and messaging-2 elements to a position **above** compute-0, then compute-0 gets stopped as it should. [root@fastvm-rhel-8-0-23 pacemaker]# cp /tmp/sos_cib_bad.xml /tmp/sos_cib_test.xml [root@fastvm-rhel-8-0-23 pacemaker]# grep '<node_state' /tmp/sos_cib_test.xml <node_state id="4" uname="database-0" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> ...Omitting some non-remote nodes for space... <node_state id="7" uname="messaging-0" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state id="9" uname="messaging-2" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource"> <node_state id="compute-0" remote_node="true" uname="compute-0" in_ccm="true" crm-debug-origin="do_state_transition"> ...The rest are all remote nodes... [root@fastvm-rhel-8-0-23 pacemaker]# crm_simulate -S --xml-file /tmp/sos_cib_good.xml 2>&1 | awk '$2 == "Stop" && $3 == "compute-0"' * Stop compute-0 ( controller-0 ) due to node availability [root@fastvm-rhel-8-0-23 pacemaker]# crm_simulate -S --xml-file /tmp/sos_cib_bad.xml 2>&1 | awk '$2 == "Stop" && $3 == "compute-0"' [root@fastvm-rhel-8-0-23 pacemaker]# crm_simulate -S --xml-file /tmp/sos_cib_test.xml 2>&1 | awk '$2 == "Stop" && $3 == "compute-0"' * Stop compute-0 ( controller-0 ) due to node availability ----- Here's what seems to be happening. unpack_node_loop() iterates over all the nodes in <status> in the order in which they appear. When unpack_node_loop() reaches compute-0, it sets rsc->next_role to RSC_ROLE_STOPPED. However, if the loop later reaches a non-remote node whose history contains a monitor operation with rc=7 for the compute-0 remote resource, it sets rsc->next_role to RSC_ROLE_UNKNOWN for compute-0 based on the op result. determine_op_status()[1] and/or update_resource_state()[2] are setting rsc->next_role back to RSC_ROLE_UNKNOWN. [1] https://github.com/ClusterLabs/pacemaker/blob/master/lib/pengine/unpack.c#L3115 [2] https://github.com/ClusterLabs/pacemaker/blob/master/lib/pengine/unpack.c#L3510 If next_role == RSC_ROLE_STOPPED at the end of the loop, the compute-0 resource gets stopped correctly. If next_role == RSC_ROLE_UNKNOWN, the compute-0 resource continues running. Two possible approaches come to mind. - Process all the non-remote nodes' state elements first, and then process the remote and maybe guest nodes' states. - Ensure that the remote resource's next_role doesn't get set to RSC_ROLE_UNKNOWN later in the loop if it's already been set to RSC_ROLE_STOPPED due to a shutdown. I'm not sure that a simple check (e.g., `if (rsc->next_role != RSC_ROLE_STOPPED)`) would be safe for all situations. I'll wait for Ken's thoughts when he has time. I've included some gdb output below. -------------------- GOOD: Breakpoint 5, determine_op_status (data_set=0x634ad0, on_fail=0x7fffffffe094, xml_op=0x1205f20, node=0x12744a0, target_rc=7, rc=<optimized out>, rsc=0x127a100) at unpack.c:3132 3132 *on_fail = action_fail_ignore; rsc->id: compute-0 (gdb) p node->details->uname $8 = 0xe4cf10 "controller-1" Breakpoint 6, update_resource_state (rsc=rsc@entry=0x127a100, node=node@entry=0x12744a0, xml_op=xml_op@entry=0x1205f20, task=task@entry=0x12063f0 "monitor", rc=<optimized out>, last_failure=<optimized out>, on_fail=0x7fffffffe094, data_set=0x634ad0) at unpack.c:3538 3538 *on_fail = action_fail_ignore; rsc->id: compute-0 ... Breakpoint 1, unpack_handle_remote_attrs (data_set=0x634ad0, state=0x12227a0, this_node=0x1278b10) at unpack.c:1000 1000 rsc->next_role = RSC_ROLE_STOPPED; rsc->id: compute-0 (gdb) p this_node->details->uname $9 = 0xe55eb0 "compute-0" [Loop finishes up without changing rsc->next_role for compute-0 again.] Bad: Breakpoint 5, determine_op_status (data_set=0x634ad0, on_fail=0x7fffffffe094, xml_op=0x1168360, node=0x12a3c90, target_rc=7, rc=<optimized out>, rsc=0x12a6d40) at unpack.c:3132 3132 *on_fail = action_fail_ignore; rsc->id: compute-0 (gdb) p node->details->uname $12 = 0xe680e0 "messaging-1" Breakpoint 6, update_resource_state (rsc=rsc@entry=0x12a6d40, node=node@entry=0x12a3c90, xml_op=xml_op@entry=0x1168360, task=task@entry=0x1168830 "monitor", rc=<optimized out>, last_failure=<optimized out>, on_fail=0x7fffffffe094, data_set=0x634ad0) at unpack.c:3538 3538 *on_fail = action_fail_ignore; rsc->id: compute-0 ... Breakpoint 1, unpack_handle_remote_attrs (data_set=0x634ad0, state=0x11a8040, this_node=0x12a5750) at unpack.c:1000 1000 rsc->next_role = RSC_ROLE_STOPPED; rsc->id: compute-0 ... Breakpoint 5, determine_op_status (data_set=0x634ad0, on_fail=0x7fffffffe094, xml_op=0x11c76a0, node=0x12a3af0, target_rc=7, rc=<optimized out>, rsc=0x12a6d40) at unpack.c:3132 3132 *on_fail = action_fail_ignore; rsc->id: compute-0 (gdb) p node->details->uname $13 = 0xe67150 "messaging-0" Breakpoint 6, update_resource_state (rsc=rsc@entry=0x12a6d40, node=node@entry=0x12a3af0, xml_op=xml_op@entry=0x11c76a0, task=task@entry=0x11c7b70 "monitor", rc=<optimized out>, last_failure=<optimized out>, on_fail=0x7fffffffe094, data_set=0x634ad0) at unpack.c:3538 3538 *on_fail = action_fail_ignore; rsc->id: compute-0 ... Breakpoint 5, determine_op_status (data_set=0x634ad0, on_fail=0x7fffffffe094, xml_op=0x1259c40, node=0x12a4720, target_rc=7, rc=<optimized out>, rsc=0x12a6d40) at unpack.c:3132 3132 *on_fail = action_fail_ignore; rsc->id: compute-0 (gdb) p node->details->uname $15 = 0xe69070 "messaging-2" (gdb) fn Breakpoint 6, update_resource_state (rsc=rsc@entry=0x12a6d40, node=node@entry=0x12a4720, xml_op=xml_op@entry=0x1259c40, task=task@entry=0x125a110 "monitor", rc=<optimized out>, last_failure=<optimized out>, on_fail=0x7fffffffe094, data_set=0x634ad0) at unpack.c:3538 3538 *on_fail = action_fail_ignore; rsc->id: compute-0 [Loop finishes up rsc->next_role still set to RSC_ROLE_UNKNOWN for compute-0.] Ah, great analysis. I'll tentatively push it up to 8.4 and hope we can find time for it. Michele, can you see whether OpenStack QA can ack it? ITM 15 is Feb. 15 so we'd need dev+QA to be done by then. (Any later than that, we'd need exception/blocker, which might be possible but is quite difficult these days.) I'm not sure about the exact fix though. I'm not sure why rc=7 sets the resource's next role to unknown, so I need to look at why that's happening. Fix merged upstream as of commit 588a7c6 One of the CIBs that triggered the issue in this BZ was added as an upstream regression test. Verified, remote pacemaker's pacemaker_remote systemd unit stop operation succeeded: [stack@undercloud-0 ~]$ ansible controller -b -mshell -a 'rpm -q pacemaker' controller-1 | CHANGED | rc=0 >> pacemaker-2.0.5-6.el8.x86_64 controller-0 | CHANGED | rc=0 >> pacemaker-2.0.5-6.el8.x86_64 controller-2 | CHANGED | rc=0 >> pacemaker-2.0.5-6.el8.x86_64 [root@overcloud-database-1 ~]# pcs status|head -n 30 Cluster name: tripleo_cluster Cluster Summary: Full List of Resources: [..] * overcloud-database-0 (ocf::pacemaker:remote): Started overcloud-controller-0 * overcloud-messaging-1 (ocf::pacemaker:remote): Started overcloud-controller-1 * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-database-1 * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-messaging-0 [..] [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'systemctl --type=service|grep pacemaker' [WARNING]: Found both group and host with same name: undercloud compute-0 | FAILED | rc=1 >> non-zero return code controller-2 | CHANGED | rc=0 >> pacemaker.service loaded active running Pacemaker High Availability Cluster Manager database-0 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon controller-0 | CHANGED | rc=0 >> pacemaker.service loaded active running Pacemaker High Availability Cluster Manager controller-1 | CHANGED | rc=0 >> pacemaker.service loaded active running Pacemaker High Availability Cluster Manager database-1 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon messaging-0 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon database-2 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon messaging-2 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon messaging-1 | CHANGED | rc=0 >> pacemaker_remote.service loaded active running Pacemaker Remote executor daemon networker-0 | FAILED | rc=1 >> non-zero return code networker-2 | FAILED | rc=1 >> non-zero return code networker-1 | FAILED | rc=1 >> non-zero return code [root@overcloud-database-1 ~]# systemctl status pacemaker_remote.service ● pacemaker_remote.service - Pacemaker Remote executor daemon Loaded: loaded (/usr/lib/systemd/system/pacemaker_remote.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2021-02-14 15:56:04 UTC; 13min ago Docs: man:pacemaker-remoted https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Remote/index.html Main PID: 554340 (pacemaker-remot) Tasks: 1 Memory: 8.5M CGroup: /system.slice/pacemaker_remote.service └─554340 /usr/sbin/pacemaker-remoted Feb 14 16:00:13 overcloud-database-1 podman[561219]: 2021-02-14 16:00:13.499189146 +0000 UTC m=+0.346036491 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:01:13 overcloud-database-1 podman[562617]: 2021-02-14 16:01:13.937406591 +0000 UTC m=+0.351301140 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:02:14 overcloud-database-1 podman[564119]: 2021-02-14 16:02:14.307998002 +0000 UTC m=+0.297056790 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:03:14 overcloud-database-1 podman[565415]: 2021-02-14 16:03:14.733103478 +0000 UTC m=+0.344107949 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:04:15 overcloud-database-1 podman[566799]: 2021-02-14 16:04:15.177858359 +0000 UTC m=+0.355126402 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:05:15 overcloud-database-1 podman[568300]: 2021-02-14 16:05:15.569375282 +0000 UTC m=+0.312613801 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:06:16 overcloud-database-1 podman[569740]: 2021-02-14 16:06:16.009317706 +0000 UTC m=+0.340700502 container exec a3e7482a98954f3d197e4902e2744e2c3b> Feb 14 16:07:16 overcloud-database-1 podman[571243]: 2021-02-14 16:07:16.404761022 +0000 UTC m=+0.319185999 container exec a3e7482a98954f3d197e4902e2744e2c3 [root@overcloud-database-1 ~]# systemctl stop pacemaker_remote.service #from the remote node Feb 14 16:10:06 overcloud-database-1 systemd[1]: Stopping Pacemaker Remote executor daemon... Feb 14 16:10:06 overcloud-database-1 pacemaker-remoted[554340]: notice: Caught 'Terminated' signal Feb 14 16:10:06 overcloud-database-1 pacemaker-remoted[554340]: notice: TLS server session ended Feb 14 16:10:06 overcloud-database-1 systemd[1]: sysstat-collect.service: Succeeded. Feb 14 16:10:06 overcloud-database-1 systemd[1]: Started system activity accounting tool. Feb 14 16:10:06 overcloud-database-1 podman[575052]: 2021-02-14 16:10:06.863900025 +0000 UTC m=+0.351231662 container exec f6aaf7f91ddf29df98b72a1d56c931a86e762b03a23fe68c7f968870c66c7328 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=clustercheck) Feb 14 16:10:09 overcloud-database-1 healthcheck_clustercheck[575052]: curl: (22) The requested URL returned error: 503 Service Unavailable Feb 14 16:10:09 overcloud-database-1 healthcheck_clustercheck[575052]: 503 172.17.1.90:9200 2.203950 seconds Feb 14 16:10:09 overcloud-database-1 healthcheck_clustercheck[575052]: Error: non zero exit code: 1: OCI runtime error Feb 14 16:10:09 overcloud-database-1 systemd[1]: tripleo_clustercheck_healthcheck.service: Main process exited, code=exited, status=1/FAILURE Feb 14 16:10:09 overcloud-database-1 systemd[1]: tripleo_clustercheck_healthcheck.service: Failed with result 'exit-code'. Feb 14 16:10:09 overcloud-database-1 systemd[1]: Failed to start clustercheck healthcheck. Feb 14 16:10:10 overcloud-database-1 galera(galera)[6162]: INFO: MySQL stopped Feb 14 16:10:11 overcloud-database-1 galera(galera)[6162]: INFO: attempting to read safe_to_bootstrap flag from /var/lib/mysql/grastate.dat Feb 14 16:10:11 overcloud-database-1 galera(galera)[6162]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Feb 14 16:10:11 overcloud-database-1 galera(galera)[6162]: INFO: Last commit version found: 405865 Feb 14 16:10:12 overcloud-database-1 galera(galera)[6262]: INFO: MySQL is not running Feb 14 16:10:13 overcloud-database-1 pacemaker-remoted[8]: notice: Cleaning up after remote client pacemaker-remote-overcloud-database-1:3123 disconnected Feb 14 16:10:13 overcloud-database-1 podman[575336]: 2021-02-14 16:10:13.701680872 +0000 UTC m=+0.325159980 container exec a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=galera-bundle-podman-0) Feb 14 16:10:13 overcloud-database-1 pacemaker-remoted[8]: notice: Caught 'Terminated' signal Feb 14 16:10:13 overcloud-database-1 systemd[1]: libpod-a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30.scope: Succeeded. Feb 14 16:10:13 overcloud-database-1 systemd[1]: libpod-a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30.scope: Consumed 36.655s CPU time Feb 14 16:10:13 overcloud-database-1 podman[575368]: 2021-02-14 16:10:13.961362881 +0000 UTC m=+0.195648448 container died a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=galera-bundle-podman-0) Feb 14 16:10:13 overcloud-database-1 podman[575368]: 2021-02-14 16:10:13.962396491 +0000 UTC m=+0.196682023 container stop a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=galera-bundle-podman-0) Feb 14 16:10:13 overcloud-database-1 systemd[563387]: var-lib-containers-storage-overlay\x2dcontainers-a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30-userdata-shm.mount: Succeeded. Feb 14 16:10:13 overcloud-database-1 systemd[1]: var-lib-containers-storage-overlay\x2dcontainers-a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30-userdata-shm.mount: Succeeded. Feb 14 16:10:13 overcloud-database-1 systemd[563387]: var-lib-containers-storage-overlay-0bf15e027048e02692d2dcf979b76acbd519c72900d9acaa905a2aea2803c156-merged.mount: Succeeded. Feb 14 16:10:13 overcloud-database-1 systemd[1]: var-lib-containers-storage-overlay-0bf15e027048e02692d2dcf979b76acbd519c72900d9acaa905a2aea2803c156-merged.mount: Succeeded. Feb 14 16:10:13 overcloud-database-1 podman(galera-bundle-podman-0)[575332]: INFO: a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 Feb 14 16:10:13 overcloud-database-1 podman[575384]: 2021-02-14 16:10:13.998132791 +0000 UTC m=+0.121910571 container cleanup a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=galera-bundle-podman-0) Feb 14 16:10:14 overcloud-database-1 systemd[1]: libpod-conmon-a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30.scope: Succeeded. Feb 14 16:10:14 overcloud-database-1 podman(galera-bundle-podman-0)[575332]: NOTICE: Cleaning up inactive container, galera-bundle-podman-0. Feb 14 16:10:14 overcloud-database-1 podman[575447]: 2021-02-14 16:10:14.15334016 +0000 UTC m=+0.065534991 container remove a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:16.2_20210129.1, name=galera-bundle-podman-0) Feb 14 16:10:14 overcloud-database-1 podman(galera-bundle-podman-0)[575332]: INFO: a3e7482a98954f3d197e4902e2744e2c3b3373f1357d94ce8c8a8ac183d43b30 Feb 14 16:10:14 overcloud-database-1 pacemaker-remoted[554340]: notice: Cleaning up after remote client pacemaker-remote-172.17.1.90:3121 disconnected Feb 14 16:10:14 overcloud-database-1 systemd[1]: pacemaker_remote.service: Succeeded. Feb 14 16:10:14 overcloud-database-1 systemd[1]: Stopped Pacemaker Remote executor daemon. #from the DC: pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-attrd [189032] (attrd_peer_update) notice: Setting shutdown[overcloud-database-1]: (unset) -> 1613319006 | from overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-attrd [189032] (write_attribute) info: Sent CIB request 77 with 1 change for shutdown (id n/a, set n/a) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_process_request) info: Forwarding cib_modify operation for section status to all (origin=local/attrd/77) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_perform_op) info: Diff: --- 0.134.357 2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_perform_op) info: Diff: +++ 0.134.358 (null) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_perform_op) info: + /cib: @num_updates=358 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_perform_op) info: ++ /cib/status/node_state[@id='overcloud-database-1']/transient_attributes[@id='overcloud-database-1']/instance_attributes[@id='status-overcloud-database-1']: <nvpair id="status-overcloud-database-1-shutdown" name="shutdown" value="1613319006"/> pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-based [189029] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=overcloud-controller-2/attrd/77, version=0.134.358) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-attrd [189032] (attrd_cib_callback) info: CIB update 77 result for shutdown: OK | rc=0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-attrd [189032] (attrd_cib_callback) info: * shutdown[overcloud-database-1]=1613319006 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-controld [189034] (abort_transition_graph) info: Transition 62 aborted by status-overcloud-database-1-shutdown doing create shutdown=1613319006: Transient attribute change | cib=0.134.358 source=abort_unless_down:313 path=/cib/status/node_state[@id='overcloud-database-1']/transient_attributes[@id='overcloud-datab ase-1']/instance_attributes[@id='status-overcloud-database-1'] complete=true pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-controld [189034] (do_state_transition) notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status_fencing) info: Node overcloud-controller-2 is active pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status) info: Node overcloud-controller-2 is online pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status_fencing) info: Node overcloud-controller-0 is active pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status) info: Node overcloud-controller-0 is online pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status_fencing) info: Node overcloud-controller-1 is active pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (determine_online_status) info: Node overcloud-controller-1 is online pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (unpack_handle_remote_attrs) info: Node overcloud-database-1 is shutting down pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-database-0 (ocf::pacemaker:remote): Started overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-database-1 (ocf::pacemaker:remote): Started overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-database-2 (ocf::pacemaker:remote): Started overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-messaging-0 (ocf::pacemaker:remote): Started overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-messaging-1 (ocf::pacemaker:remote): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: overcloud-messaging-2 (ocf::pacemaker:remote): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-database-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-database-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-database-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-messaging-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-messaging-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-messaging-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: redis-bundle-0 (ocf::heartbeat:redis): Slave overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: redis-bundle-1 (ocf::heartbeat:redis): Slave overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: redis-bundle-2 (ocf::heartbeat:redis): Master overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-192.168.24.49 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-10.0.0.140 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-172.17.1.53 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-172.17.1.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-172.17.3.40 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-172.17.4.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: ip-172.17.1.99 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-5254006ca0e4 (stonith:fence_ipmilan): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-525400c359a9 (stonith:fence_ipmilan): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-5254002d7bae (stonith:fence_ipmilan): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-52540012bfbb (stonith:fence_ipmilan): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-525400fe2410 (stonith:fence_ipmilan): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-525400fab13b (stonith:fence_ipmilan): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-5254009f16db (stonith:fence_ipmilan): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-525400236d06 (stonith:fence_ipmilan): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: stonith-fence_ipmilan-5254000f4500 (stonith:fence_ipmilan): Started overcloud-controller-1 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (log_list_item) info: Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]: openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-2 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__native_allocate) info: Resource overcloud-database-1 cannot run anywhere pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__native_merge_weights) info: galera-bundle-podman-0: Rolling back optional scores from galera-bundle-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__native_allocate) info: Resource galera-bundle-podman-0 cannot run anywhere pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: Promoting galera:1 (Master galera-bundle-1) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: Promoting galera:2 (Master galera-bundle-2) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: galera-bundle-master: Promoted 2 instances of a possible 3 to master pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: Promoting redis:2 (Master redis-bundle-2) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: redis-bundle-master: Promoted 1 instances of a possible 1 to master pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: Promoting ovndb_servers:0 (Master ovn-dbs-bundle-0) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (pcmk__set_instance_roles) info: ovn-dbs-bundle-master: Promoted 1 instances of a possible 1 to master pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Start recurring monitor (20s) for galera:0 on galera-bundle-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Cancelling action galera:0_monitor_10000 (Master vs. Slave) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Start recurring monitor (30s) for galera:0 on galera-bundle-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Start recurring monitor (20s) for galera:0 on galera-bundle-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Cancelling action galera:0_monitor_10000 (Master vs. Slave) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (RecurringOp) info: Start recurring monitor (30s) for galera:0 on galera-bundle-0 pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (LogActions) info: Leave overcloud-database-0 (Started overcloud-controller-0) pacemaker.log:Feb 14 16:10:06 overcloud-controller-2 pacemaker-schedulerd[189033] (LogAction) notice: * Stop overcloud-database-1 ( overcloud-controller-0 ) due to node availability ~ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |