Bug 1310486
Summary: | Pacemaker looses shutdown requests under some conditions | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andrew Beekhof <abeekhof> | ||||||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 7.2 | CC: | 277937719, abeekhof, cluster-maint, ggillies, jkortus, jruemker, obenes, ochalups, phagara, sbradley | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 7.3 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | pacemaker-1.1.15-10.el7 | Doc Type: | No Doc Update | ||||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-11-03 18:58:46 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Andrew Beekhof
2016-02-22 00:22:57 UTC
This will be investigated in the 7.4 timeframe. (In reply to Andrew Beekhof from comment #0) > Description of problem: > > A poorly timed (non-DC) shutdown request can be lost, resulting in a cluster > that appears hung for shutdown-escalation seconds. > After this time, pacemaker on the peer will exiting without having shut down > any of its services. > > The DC will eventually notice this and fence the node (assuming it is > enabled). > > Version-Release number of selected component (if applicable): > > Feb 18 18:23:48 [37360] stage-bare-ctrl-1.localdomain crmd: notice: > main: CRM Git Version: 1.1.13-10.el7 (44eb2dd) > > How reproducible: > > random > > Steps to Reproduce: > > 1. shut down a node at just the wrong time (after a new DC is elected) > > Actual results: > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ > input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ] > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > abort_transition_graph: Transition aborted: Peer Cancelled > (source=do_te_invoke:151, 1) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: warning: > match_down_event: No match for shutdown action on 1 > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > abort_transition_graph: Transition aborted by deletion of lrm[@id='1']: > Resource state removal (cib=0.171.16, source=abort_unless_down:333, > path=/cib/status/node_state[@id='1']/lrm[@id='1'], 1) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > abort_transition_graph: Transition aborted: LRM Refresh > (source=process_resource_updates:266, 1) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > crm_update_peer_expected: handle_request: Node stage-bare-ctrl-0[1] - > expected state is now down (was member) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > handle_shutdown_request: Creating shutdown request for stage-bare-ctrl-0 > (state=S_POLICY_ENGINE) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > abort_transition_graph: Transition aborted by status-1-shutdown, > shutdown=1455858801: Transient attribute change (modify cib=0.171.23, > source=abort_unless_down:319, > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/ > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > abort_transition_graph: Transition aborted by status-1-shutdown, shutdown=0: > Transient attribute change (modify cib=0.171.24, > source=abort_unless_down:319, > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/ > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1) > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE > [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > do_te_invoke: Processing graph 32 (ref=pe_calc-dc-1455858801-579) derived > from /var/lib/pacemaker/pengine/pe-input-13.bz2 > > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain crmd: error: > crm_timer_popped: Shutdown Escalation (I_STOP) just popped in state > S_NOT_DC! (1200000ms) > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain crmd: error: > lrm_state_verify_stopped: 41 resources were active at shutdown. > > Expected results: > > Node shuts down promptly. > > Additional info: > > It appears abort_transition_graph() is not correctly restarting the PE run > (invalidating the existing query that predated the handle_shutdown_request() > call). hello,when I use pacemaker,there is three node in cluster.I set the maintenance-mode=true,then I run systemctl stop pacemaker and systemctl stop corosync on all node in turn. But sometimes there was one node take 20 minutes to shutdown pacemaker and this node is not DC node.The DC node is the last one to shutdown.Here is my corosync log, is this a bug? Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: + /cib: @num_updates=27 Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/191, version=0.16.27) Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/192, version=0.16.27) Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/193, version=0.16.27) Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: Diff: --- 0.16.27 2 Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: Diff: +++ 0.16.28 (null) Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: + /cib: @num_updates=28, @dc-uuid=1 Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: Completed cib_modify operation for section cib: OK (rc=0, origin=cvm193/crmd/58, version=0.16.28) Jul 29 08:54:09 [16231] cvm194 cib: info: cib_process_ping: Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for 0.16.28 (0x28f3f90 0) Jul 29 08:54:43 [16230] cvm194 pacemakerd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated Jul 29 08:54:43 [16230] cvm194 pacemakerd: notice: pcmk_shutdown_worker: Shuting down Pacemaker Connection closed by foreign host.emakerd: notice: stop_child: Stopping crmd: Sent -15 to process 16236 Jul 29 08:54:43 [16236] cvm194 crmd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated Type `help' to learn how to use Xshell prompt.otice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Xshell:\> 54:43 [16236] cvm194 crmd: info: do_shutdown_req: Sending shutdown request to cvm193 Jul 29 08:54:43 [16236] cvm194 crmd: info: crm_update_peer_expected: handle_request: Node cvm194[3] - expected state is now down (was (null)) Jul 29 08:54:43 [16234] cvm194 attrd: info: attrd_peer_update: Setting shutdown[cvm194]: 0 -> 1469753683 from cvm193 Jul 29 08:54:43 [16231] cvm194 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/194, version=0.16.28) Jul 29 08:54:48 [16231] cvm194 cib: info: cib_process_ping: Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for 0.16.28 (0x28f3f90 0) Jul 29 09:14:43 [16236] cvm194 crmd: error: crm_timer_popped: Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms) Jul 29 09:14:43 [16236] cvm194 crmd: notice: do_state_transition: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_TIMER_POPPED origin=crm_timer_popped ] Jul 29 09:14:43 [16236] cvm194 crmd: info: do_shutdown: Disconnecting STONITH... Jul 29 09:14:43 [16236] cvm194 crmd: info: tengine_stonith_connection_destroy: Fencing daemon disconnected Jul 29 09:14:43 [16236] cvm194 crmd: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown... waiting (0 ops remaining) Jul 29 09:14:43 [16236] cvm194 crmd: error: lrm_state_verify_stopped: 1 resources were active at shutdown. Jul 29 09:14:43 [16236] cvm194 crmd: info: do_lrm_control: Disconnecting from the LRM Jul 29 09:14:43 [16236] cvm194 crmd: info: lrmd_api_disconnect: Disconnecting from 1 lrmd service Jul 29 09:14:43 [16236] cvm194 crmd: info: lrmd_ipc_connection_destroy: IPC connection destroyed Jul 29 09:14:43 [16236] cvm194 crmd: info: lrm_connection_destroy: LRM Connection disconnected Jul 29 09:14:43 [16236] cvm194 crmd: info: lrmd_api_disconnect: Disconnecting from 1 lrmd service Jul 29 09:14:43 [16236] cvm194 crmd: notice: do_lrm_control: Disconnected from the LRM Jul 29 09:14:43 [16236] cvm194 crmd: info: crm_cluster_disconnect: Disconnecting from cluster infrastructure: corosync Jul 29 09:14:43 [16236] cvm194 crmd: notice: terminate_cs_connection: Disconnecting from Corosync Jul 29 09:14:43 [16236] cvm194 crmd: info: crm_cluster_disconnect: Disconnected from corosync Jul 29 09:14:43 [16236] cvm194 crmd: info: do_ha_control: Disconnected from the cluster Jul 29 09:14:43 [16236] cvm194 crmd: info: do_cib_control: Disconnecting CIB Jul 29 09:14:43 [16236] cvm194 crmd: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Jul 29 09:14:43 [16236] cvm194 crmd: info: qb_ipcs_us_withdraw: withdrawing server sockets Jul 29 09:14:43 [16236] cvm194 crmd: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Jul 29 09:14:43 [16236] cvm194 crmd: info: do_exit: [crmd] stopped (0) Jul 29 09:14:43 [16236] cvm194 crmd: info: crmd_exit: Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] (In reply to zhu.rong from comment #3) > (In reply to Andrew Beekhof from comment #0) > > Description of problem: > > > > A poorly timed (non-DC) shutdown request can be lost, resulting in a cluster > > that appears hung for shutdown-escalation seconds. > > After this time, pacemaker on the peer will exiting without having shut down > > any of its services. > > > > The DC will eventually notice this and fence the node (assuming it is > > enabled). > > > > Version-Release number of selected component (if applicable): > > > > Feb 18 18:23:48 [37360] stage-bare-ctrl-1.localdomain crmd: notice: > > main: CRM Git Version: 1.1.13-10.el7 (44eb2dd) > > > > How reproducible: > > > > random > > > > Steps to Reproduce: > > > > 1. shut down a node at just the wrong time (after a new DC is elected) > > > > Actual results: > > > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ > > input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ] > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > abort_transition_graph: Transition aborted: Peer Cancelled > > (source=do_te_invoke:151, 1) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: warning: > > match_down_event: No match for shutdown action on 1 > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > abort_transition_graph: Transition aborted by deletion of lrm[@id='1']: > > Resource state removal (cib=0.171.16, source=abort_unless_down:333, > > path=/cib/status/node_state[@id='1']/lrm[@id='1'], 1) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > abort_transition_graph: Transition aborted: LRM Refresh > > (source=process_resource_updates:266, 1) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > crm_update_peer_expected: handle_request: Node stage-bare-ctrl-0[1] - > > expected state is now down (was member) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > handle_shutdown_request: Creating shutdown request for stage-bare-ctrl-0 > > (state=S_POLICY_ENGINE) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > abort_transition_graph: Transition aborted by status-1-shutdown, > > shutdown=1455858801: Transient attribute change (modify cib=0.171.23, > > source=abort_unless_down:319, > > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/ > > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > abort_transition_graph: Transition aborted by status-1-shutdown, shutdown=0: > > Transient attribute change (modify cib=0.171.24, > > source=abort_unless_down:319, > > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/ > > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1) > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE > > [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain crmd: info: > > do_te_invoke: Processing graph 32 (ref=pe_calc-dc-1455858801-579) derived > > from /var/lib/pacemaker/pengine/pe-input-13.bz2 > > > > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain crmd: error: > > crm_timer_popped: Shutdown Escalation (I_STOP) just popped in state > > S_NOT_DC! (1200000ms) > > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain crmd: error: > > lrm_state_verify_stopped: 41 resources were active at shutdown. > > > > Expected results: > > > > Node shuts down promptly. > > > > Additional info: > > > > It appears abort_transition_graph() is not correctly restarting the PE run > > (invalidating the existing query that predated the handle_shutdown_request() > > call). > > > hello,when I use pacemaker,there are three node and six resources in cluster.I set the > maintenance-mode=true,then I run systemctl stop pacemaker and systemctl stop > corosync on all node in turn. But sometimes there was one node take 20 > minutes to shutdown pacemaker and this node is not DC node.The DC node is > the last one to shutdown.Here is my corosync log, is this a bug? > > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: + > /cib: @num_updates=27 > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=cvm193/attrd/191, version=0.16.27) > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=cvm193/attrd/192, version=0.16.27) > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=cvm193/attrd/193, version=0.16.27) > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: > Diff: --- 0.16.27 2 > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: > Diff: +++ 0.16.28 (null) > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_perform_op: + > /cib: @num_updates=28, @dc-uuid=1 > Jul 29 08:54:04 [16231] cvm194 cib: info: cib_process_request: > Completed cib_modify operation for section cib: OK (rc=0, > origin=cvm193/crmd/58, version=0.16.28) > Jul 29 08:54:09 [16231] cvm194 cib: info: cib_process_ping: > Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for > 0.16.28 (0x28f3f90 0) > Jul 29 08:54:43 [16230] cvm194 pacemakerd: notice: crm_signal_dispatch: > Invoking handler for signal 15: Terminated > Jul 29 08:54:43 [16230] cvm194 pacemakerd: notice: pcmk_shutdown_worker: > Shuting down Pacemaker > Connection closed by foreign host.emakerd: notice: stop_child: > Stopping crmd: Sent -15 to process 16236 > Jul 29 08:54:43 [16236] cvm194 crmd: notice: crm_signal_dispatch: > Invoking handler for signal 15: Terminated > Type `help' to learn how to use Xshell prompt.otice: crm_shutdown: > Requesting shutdown, upper limit is 1200000ms > Xshell:\> 54:43 [16236] cvm194 crmd: info: do_shutdown_req: > Sending shutdown request to cvm193 > Jul 29 08:54:43 [16236] cvm194 crmd: info: > crm_update_peer_expected: handle_request: Node cvm194[3] - expected state > is now down (was (null)) > Jul 29 08:54:43 [16234] cvm194 attrd: info: attrd_peer_update: > Setting shutdown[cvm194]: 0 -> 1469753683 from cvm193 > Jul 29 08:54:43 [16231] cvm194 cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=cvm193/attrd/194, version=0.16.28) > Jul 29 08:54:48 [16231] cvm194 cib: info: cib_process_ping: > Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for > 0.16.28 (0x28f3f90 0) > Jul 29 09:14:43 [16236] cvm194 crmd: error: crm_timer_popped: > Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms) > Jul 29 09:14:43 [16236] cvm194 crmd: notice: do_state_transition: > State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_TIMER_POPPED > origin=crm_timer_popped ] > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_shutdown: > Disconnecting STONITH... > Jul 29 09:14:43 [16236] cvm194 crmd: info: > tengine_stonith_connection_destroy: Fencing daemon disconnected > Jul 29 09:14:43 [16236] cvm194 crmd: notice: > lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown... > waiting (0 ops remaining) > Jul 29 09:14:43 [16236] cvm194 crmd: error: > lrm_state_verify_stopped: 1 resources were active at shutdown. > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_lrm_control: > Disconnecting from the LRM > Jul 29 09:14:43 [16236] cvm194 crmd: info: lrmd_api_disconnect: > Disconnecting from 1 lrmd service > Jul 29 09:14:43 [16236] cvm194 crmd: info: > lrmd_ipc_connection_destroy: IPC connection destroyed > Jul 29 09:14:43 [16236] cvm194 crmd: info: lrm_connection_destroy: > LRM Connection disconnected > Jul 29 09:14:43 [16236] cvm194 crmd: info: lrmd_api_disconnect: > Disconnecting from 1 lrmd service > Jul 29 09:14:43 [16236] cvm194 crmd: notice: do_lrm_control: > Disconnected from the LRM > Jul 29 09:14:43 [16236] cvm194 crmd: info: crm_cluster_disconnect: > Disconnecting from cluster infrastructure: corosync > Jul 29 09:14:43 [16236] cvm194 crmd: notice: > terminate_cs_connection: Disconnecting from Corosync > Jul 29 09:14:43 [16236] cvm194 crmd: info: crm_cluster_disconnect: > Disconnected from corosync > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_ha_control: > Disconnected from the cluster > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_cib_control: > Disconnecting CIB > Jul 29 09:14:43 [16236] cvm194 crmd: info: > crmd_cib_connection_destroy: Connection to the CIB terminated... > Jul 29 09:14:43 [16236] cvm194 crmd: info: qb_ipcs_us_withdraw: > withdrawing server sockets > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_exit: Performing > A_EXIT_0 - gracefully exiting the CRMd > Jul 29 09:14:43 [16236] cvm194 crmd: info: do_exit: [crmd] > stopped (0) > Jul 29 09:14:43 [16236] cvm194 crmd: info: crmd_exit: Dropping > I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] my corosync and pacemaker is: corosync-2.3.4-7.el7.x86_64 pacemaker-1.1.13-10.el7.x86_64 Sounds like this bug. (In reply to Andrew Beekhof from comment #5) > Sounds like this bug. think you so much! Can you attach sos reports from your 3 nodes please? It would be useful to have some more recent logs (In reply to Andrew Beekhof from comment #7) > Can you attach sos reports from your 3 nodes please? > It would be useful to have some more recent logs Sorry,please tell me what sos reports is? And the environment is not exists at this time. (In reply to zhu.rong from comment #8) > (In reply to Andrew Beekhof from comment #7) > > Can you attach sos reports from your 3 nodes please? > > It would be useful to have some more recent logs > > Sorry,please tell me what sos reports is? And the environment is not exists > at this time. sosreport is a tool for collecting logs and configuration from a cluster for troubleshooting purposes. But if you don't have access to the environment any longer, it doesn't matter. Based on the logs here, I see one possible way this might happen, but I would expect it to be possible only if the affected node is rebooting (or at least restarting cluster services) at the same time the cluster is shutting down. Additional logs could help confirm whether that is the case here. If anyone sees this issue happen again, please attach the output of either sosreport or crm_report here. In the meantime, I will address the issue I've found, and if we receive no further information, I'll assume it take cares of this issue. (In reply to Ken Gaillot from comment #9) > (In reply to zhu.rong from comment #8) > > (In reply to Andrew Beekhof from comment #7) > > > Can you attach sos reports from your 3 nodes please? > > > It would be useful to have some more recent logs > > > > Sorry,please tell me what sos reports is? And the environment is not exists > > at this time. > > sosreport is a tool for collecting logs and configuration from a cluster for > troubleshooting purposes. But if you don't have access to the environment > any longer, it doesn't matter. > > Based on the logs here, I see one possible way this might happen, but I > would expect it to be possible only if the affected node is rebooting (or at > least restarting cluster services) at the same time the cluster is shutting > down. Additional logs could help confirm whether that is the case here. > > If anyone sees this issue happen again, please attach the output of either > sosreport or crm_report here. In the meantime, I will address the issue I've > found, and if we receive no further information, I'll assume it take cares > of this issue. Thank you! Sometimes it happens in my work recently,if it happens again,I'll attach the crm_report here. Created attachment 1188541 [details]
crm_report
This is the crm_report of my cluster. I have a three node pacemaker cluster,when entered maintenance-mode and then use systemctl stop corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl status pacemaker,it's show me deactivating.But I just use systemctl stop corosync.
(In reply to zhu.rong from comment #11) > Created attachment 1188541 [details] > crm_report > > This is the crm_report of my cluster. I have a three node pacemaker > cluster,when entered maintenance-mode and then use systemctl stop > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl > status pacemaker,it's show me deactivating.But I just use systemctl stop > corosync. Sorry,there is no /var/log/cluster/corosync.conf log, I'll collect again when next time hangs on. (In reply to zhu.rong from comment #12) > (In reply to zhu.rong from comment #11) > > Created attachment 1188541 [details] > > crm_report > > > > This is the crm_report of my cluster. I have a three node pacemaker > > cluster,when entered maintenance-mode and then use systemctl stop > > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl > > status pacemaker,it's show me deactivating.But I just use systemctl stop > > corosync. > > Sorry,there is no /var/log/cluster/corosync.conf log, I'll collect again > when next time hangs on. Often, the system log is sufficient, but here it's mostly empty, so it's difficult to see what's going on. From the live CIB, I can see that node cvm50 has not completed an election (join="down") even though its crmd peer membership is active (in_ccm="true" and expected="member"). Without the logs, I don't know why this happened, but it doesn't sound like the issue I found, since the node must have previously joined successfully. If you can get the system log and/or corosync.log next time it happens, that would help. Also, I would recommend configuring fencing, as it is required to recover from certain failure scenarios. (In reply to zhu.rong from comment #11) > Created attachment 1188541 [details] > crm_report > > This is the crm_report of my cluster. I have a three node pacemaker > cluster,when entered maintenance-mode and then use systemctl stop > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl > status pacemaker,it's show me deactivating.But I just use systemctl stop > corosync. Somehow I overlooked this before responding. Corosync should not be stopped while pacemaker is running, even in maintenance mode. In maintenance mode, it is safe to stop pacemaker -- it won't stop any running resources. The recommended command is "pcs cluster stop". (In reply to Ken Gaillot from comment #14) > (In reply to zhu.rong from comment #11) > > Created attachment 1188541 [details] > > crm_report > > > > This is the crm_report of my cluster. I have a three node pacemaker > > cluster,when entered maintenance-mode and then use systemctl stop > > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl > > status pacemaker,it's show me deactivating.But I just use systemctl stop > > corosync. > > Somehow I overlooked this before responding. Corosync should not be stopped > while pacemaker is running, even in maintenance mode. In maintenance mode, > it is safe to stop pacemaker -- it won't stop any running resources. The > recommended command is "pcs cluster stop". To correct myself: "systemctl stop corosync" *should* work, because systemd should stop pacemaker first as a dependency of corosync. More logs should help figure out what went wrong. Created attachment 1188961 [details]
crm_report
It's appear again,I collected the corosync.log to the zip file. My major resource is running on cvm51,then I entered maintenance-mode,use systemctl stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
(In reply to Ken Gaillot from comment #14) > (In reply to zhu.rong from comment #11) > > Created attachment 1188541 [details] > > crm_report > > > > This is the crm_report of my cluster. I have a three node pacemaker > > cluster,when entered maintenance-mode and then use systemctl stop > > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl > > status pacemaker,it's show me deactivating.But I just use systemctl stop > > corosync. > > Somehow I overlooked this before responding. Corosync should not be stopped > while pacemaker is running, even in maintenance mode. In maintenance mode, > it is safe to stop pacemaker -- it won't stop any running resources. The > recommended command is "pcs cluster stop". I use the "pcs cluster stop" command,but it still hangs on 20 minutes. Created attachment 1189012 [details]
Reboot hangs on
Hello,there is another hangs on problem when I reboot the pacemaker cluster node.I reboot at about 09:45,and it show me the error "Failed to start engine.service: Transaction is destructive". The engine.service is the resource running on this node,I don't known why pacemaker always try to start the engine service when shutdown.Thank you!
(In reply to zhu.rong from comment #18) > Created attachment 1189012 [details] > Reboot hangs on > > Hello,there is another hangs on problem when I reboot the pacemaker cluster > node.I reboot at about 09:45,and it show me the error "Failed to start > engine.service: Transaction is destructive". The engine.service is the > resource running on this node,I don't known why pacemaker always try to > start the engine service when shutdown.Thank you! That message is from systemd, rather than pacemaker. It indicates that engine.service is enabled in systemd. If engine is supposed to be managed by pacemaker, run "systemctl disable engine.service" to take it out of systemd's control. (In reply to zhu.rong from comment #16) > Created attachment 1188961 [details] > crm_report > > It's appear again,I collected the corosync.log to the zip file. My major > resource is running on cvm51,then I entered maintenance-mode,use systemctl > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! An important log here is: Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for 5964 ms, flushing membership messages. Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token timeout increase. which indicates that high load is preventing corosync from maintaining communication. I do not think your issue is related to this bug report; the issue can probably be resolved with some configuration changes and investigation of the load issue. I recommend opening a case with Red Hat's Global Support Services group, which you can initiate through one of the methods listed at the following link: https://access.redhat.com/start/how-to-engage-red-hat-support Through a support case, we can address the full range of issues involved, and recommend configuration improvements, beyond the scope of this bug report. (In reply to Ken Gaillot from comment #20) > (In reply to zhu.rong from comment #16) > > Created attachment 1188961 [details] > > crm_report > > > > It's appear again,I collected the corosync.log to the zip file. My major > > resource is running on cvm51,then I entered maintenance-mode,use systemctl > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! > > An important log here is: > > Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for > 5964 ms, flushing membership messages. > > Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token > timeout increase. > > which indicates that high load is preventing corosync from maintaining > communication. I do not think your issue is related to this bug report; the > issue can probably be resolved with some configuration changes and > investigation of the load issue. > > I recommend opening a case with Red Hat's Global Support Services group, > which > you can initiate through one of the methods listed at the following link: > > https://access.redhat.com/start/how-to-engage-red-hat-support > > Through a support case, we can address the full range of issues involved, > and recommend configuration improvements, beyond the scope of this bug > report. Thank you very muck! But it happened again,and there is no high load message!When I use systemctl stop corosync,it hangs on."systemctl status pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log shows: cvm193 pacemakerd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated cvm193 pacemakerd: notice: pcmk_shutdown_worker: Shuting down Pacemaker cvm193 pacemakerd: notice: stop_child: Stopping crmd: Sent -15 to process 25909 cvm193 crmd: notice: crm_signal_dispatch: Invoking handler for signal 15: Terminated cvm193 crmd: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms cvm193 crmd: info: do_shutdown_req: Sending shutdown request to cvm194 cvm193 crmd: info: crm_update_peer_expected: handle_request: Node cvm193[1] - expected state is now down (was member) cvm193 attrd: info: attrd_peer_update: Setting shutdown[cvm193]: 0 -> 1470808578 from cvm194 What's the "Requesting shutdown, upper limit is 1200000ms" meanning? (In reply to zhu.rong from comment #21) > (In reply to Ken Gaillot from comment #20) > > (In reply to zhu.rong from comment #16) > > > Created attachment 1188961 [details] > > > crm_report > > > > > > It's appear again,I collected the corosync.log to the zip file. My major > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! > > > > An important log here is: > > > > Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for > > 5964 ms, flushing membership messages. > > > > Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token > > timeout increase. > > > > which indicates that high load is preventing corosync from maintaining > > communication. I do not think your issue is related to this bug report; the > > issue can probably be resolved with some configuration changes and > > investigation of the load issue. > > > > I recommend opening a case with Red Hat's Global Support Services group, > > which > > you can initiate through one of the methods listed at the following link: > > > > https://access.redhat.com/start/how-to-engage-red-hat-support > > > > Through a support case, we can address the full range of issues involved, > > and recommend configuration improvements, beyond the scope of this bug > > report. > > Thank you very muck! But it happened again,and there is no high load > message!When I use systemctl stop corosync,it hangs on."systemctl status > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log > shows: > > cvm193 pacemakerd: notice: crm_signal_dispatch: Invoking handler > for signal 15: Terminated > cvm193 pacemakerd: notice: pcmk_shutdown_worker: Shuting down > Pacemaker > cvm193 pacemakerd: notice: stop_child: Stopping crmd: Sent -15 > to process 25909 > cvm193 crmd: notice: crm_signal_dispatch: Invoking handler > for signal 15: Terminated > cvm193 crmd: notice: crm_shutdown: Requesting shutdown, > upper limit is 1200000ms > cvm193 crmd: info: do_shutdown_req: Sending shutdown request > to cvm194 > cvm193 crmd: info: crm_update_peer_expected: handle_request: > Node cvm193[1] - expected state is now down (was member) > cvm193 attrd: info: attrd_peer_update: Setting shutdown[cvm193]: > 0 -> 1470808578 from cvm194 > > What's the "Requesting shutdown, upper limit is 1200000ms" meanning? It's the standard message indicating the beginning of a shutdown, showing the value of the shutdown-escalation cluster option (the default 20 minutes in this case). That is a timeout for the shutdown process -- if the DC does not stop all resources on the node and notify the node it is safe to shutdown within that time, the node will immediately shut down. The root of the issue you're seeing and in the original report here is that the DC is not telling the node it's OK to shut down, so the node waits until shutdown-escalation time has passed, then shuts down. For your case, I'd recommend disabling all cluster-managed services in systemd, configuring fencing, and tweaking the corosync configuration (ensuring it's running real-time, and increasing the token timeout if appropriate). Thanks.But I disabed all cluster-managed services in systemd. Can I change shutdown-escalation time to 10 minutes or 5 minutes? (In reply to Ken Gaillot from comment #19) > (In reply to zhu.rong from comment #18) > > Created attachment 1189012 [details] > > Reboot hangs on > > > > Hello,there is another hangs on problem when I reboot the pacemaker cluster > > node.I reboot at about 09:45,and it show me the error "Failed to start > > engine.service: Transaction is destructive". The engine.service is the > > resource running on this node,I don't known why pacemaker always try to > > start the engine service when shutdown.Thank you! > > That message is from systemd, rather than pacemaker. It indicates that > engine.service is enabled in systemd. If engine is supposed to be managed by > pacemaker, run "systemctl disable engine.service" to take it out of > systemd's control. Yes,I'm sure I disabled the engine.service befroe hangs on. (In reply to Ken Gaillot from comment #22) > (In reply to zhu.rong from comment #21) > > (In reply to Ken Gaillot from comment #20) > > > (In reply to zhu.rong from comment #16) > > > > Created attachment 1188961 [details] > > > > crm_report > > > > > > > > It's appear again,I collected the corosync.log to the zip file. My major > > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl > > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! > > > > > > An important log here is: > > > > > > Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for > > > 5964 ms, flushing membership messages. > > > > > > Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not > > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token > > > timeout increase. > > > > > > which indicates that high load is preventing corosync from maintaining > > > communication. I do not think your issue is related to this bug report; the > > > issue can probably be resolved with some configuration changes and > > > investigation of the load issue. > > > > > > I recommend opening a case with Red Hat's Global Support Services group, > > > which > > > you can initiate through one of the methods listed at the following link: > > > > > > https://access.redhat.com/start/how-to-engage-red-hat-support > > > > > > Through a support case, we can address the full range of issues involved, > > > and recommend configuration improvements, beyond the scope of this bug > > > report. > > > > Thank you very muck! But it happened again,and there is no high load > > message!When I use systemctl stop corosync,it hangs on."systemctl status > > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log > > shows: > > > > cvm193 pacemakerd: notice: crm_signal_dispatch: Invoking handler > > for signal 15: Terminated > > cvm193 pacemakerd: notice: pcmk_shutdown_worker: Shuting down > > Pacemaker > > cvm193 pacemakerd: notice: stop_child: Stopping crmd: Sent -15 > > to process 25909 > > cvm193 crmd: notice: crm_signal_dispatch: Invoking handler > > for signal 15: Terminated > > cvm193 crmd: notice: crm_shutdown: Requesting shutdown, > > upper limit is 1200000ms > > cvm193 crmd: info: do_shutdown_req: Sending shutdown request > > to cvm194 > > cvm193 crmd: info: crm_update_peer_expected: handle_request: > > Node cvm193[1] - expected state is now down (was member) > > cvm193 attrd: info: attrd_peer_update: Setting shutdown[cvm193]: > > 0 -> 1470808578 from cvm194 > > > > What's the "Requesting shutdown, upper limit is 1200000ms" meanning? > > It's the standard message indicating the beginning of a shutdown, showing > the value of the shutdown-escalation cluster option (the default 20 minutes > in this case). That is a timeout for the shutdown process -- if the DC does > not stop all resources on the node and notify the node it is safe to > shutdown within that time, the node will immediately shut down. > > The root of the issue you're seeing and in the original report here is that > the DC is not telling the node it's OK to shut down, so the node waits until > shutdown-escalation time has passed, then shuts down. > > For your case, I'd recommend disabling all cluster-managed services in > systemd, configuring fencing, and tweaking the corosync configuration > (ensuring it's running real-time, and increasing the token timeout if > appropriate). "ensuring it's running real-time, and increasing the token timeout if appropriate". This mean set what property or someting else? (In reply to zhu.rong from comment #23) > Thanks.But I disabed all cluster-managed services in systemd. Can I change > shutdown-escalation time to 10 minutes or 5 minutes? You can set it to whatever you want, but consider the maximum time it might take to actually stop all your resources. If any resources are not finished stopping when the escalation is reached, the cluster exits immediately. (In reply to zhu.rong from comment #24) > (In reply to Ken Gaillot from comment #19) > > (In reply to zhu.rong from comment #18) > > > Created attachment 1189012 [details] > > > Reboot hangs on > > > > > > Hello,there is another hangs on problem when I reboot the pacemaker cluster > > > node.I reboot at about 09:45,and it show me the error "Failed to start > > > engine.service: Transaction is destructive". The engine.service is the > > > resource running on this node,I don't known why pacemaker always try to > > > start the engine service when shutdown.Thank you! > > > > That message is from systemd, rather than pacemaker. It indicates that > > engine.service is enabled in systemd. If engine is supposed to be managed by > > pacemaker, run "systemctl disable engine.service" to take it out of > > systemd's control. > > Yes,I'm sure I disabled the engine.service befroe hangs on. "Transaction is destructive" is a message from systemd. You may want to search for that online to figure out why that's happening. (In reply to zhu.rong from comment #25) > (In reply to Ken Gaillot from comment #22) > > (In reply to zhu.rong from comment #21) > > > (In reply to Ken Gaillot from comment #20) > > > > (In reply to zhu.rong from comment #16) > > > > > Created attachment 1188961 [details] > > > > > crm_report > > > > > > > > > > It's appear again,I collected the corosync.log to the zip file. My major > > > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl > > > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! > > > > > > > > An important log here is: > > > > > > > > Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for > > > > 5964 ms, flushing membership messages. > > > > > > > > Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not > > > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token > > > > timeout increase. > > > > > > > > which indicates that high load is preventing corosync from maintaining > > > > communication. I do not think your issue is related to this bug report; the > > > > issue can probably be resolved with some configuration changes and > > > > investigation of the load issue. > > > > > > > > I recommend opening a case with Red Hat's Global Support Services group, > > > > which > > > > you can initiate through one of the methods listed at the following link: > > > > > > > > https://access.redhat.com/start/how-to-engage-red-hat-support > > > > > > > > Through a support case, we can address the full range of issues involved, > > > > and recommend configuration improvements, beyond the scope of this bug > > > > report. > > > > > > Thank you very muck! But it happened again,and there is no high load > > > message!When I use systemctl stop corosync,it hangs on."systemctl status > > > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log > > > shows: > > > > > > cvm193 pacemakerd: notice: crm_signal_dispatch: Invoking handler > > > for signal 15: Terminated > > > cvm193 pacemakerd: notice: pcmk_shutdown_worker: Shuting down > > > Pacemaker > > > cvm193 pacemakerd: notice: stop_child: Stopping crmd: Sent -15 > > > to process 25909 > > > cvm193 crmd: notice: crm_signal_dispatch: Invoking handler > > > for signal 15: Terminated > > > cvm193 crmd: notice: crm_shutdown: Requesting shutdown, > > > upper limit is 1200000ms > > > cvm193 crmd: info: do_shutdown_req: Sending shutdown request > > > to cvm194 > > > cvm193 crmd: info: crm_update_peer_expected: handle_request: > > > Node cvm193[1] - expected state is now down (was member) > > > cvm193 attrd: info: attrd_peer_update: Setting shutdown[cvm193]: > > > 0 -> 1470808578 from cvm194 > > > > > > What's the "Requesting shutdown, upper limit is 1200000ms" meanning? > > > > It's the standard message indicating the beginning of a shutdown, showing > > the value of the shutdown-escalation cluster option (the default 20 minutes > > in this case). That is a timeout for the shutdown process -- if the DC does > > not stop all resources on the node and notify the node it is safe to > > shutdown within that time, the node will immediately shut down. > > > > The root of the issue you're seeing and in the original report here is that > > the DC is not telling the node it's OK to shut down, so the node waits until > > shutdown-escalation time has passed, then shuts down. > > > > For your case, I'd recommend disabling all cluster-managed services in > > systemd, configuring fencing, and tweaking the corosync configuration > > (ensuring it's running real-time, and increasing the token timeout if > > appropriate). > "ensuring it's running real-time, and increasing the token timeout if > appropriate". This mean set what property or someting else? I believe if you're running corosync 2.3.4-4 or later, it's realtime. For the token timeout, see the corosync.conf man page. Here is the sequence for reproducing the original bug: 1. Start a node, wait for it to become the DC 2. Start a second node 3. Tell the second node to stop while it is in the process of negotiating with the DC. Specifically just after do_cl_join_finalize_respond() is called on the second node. 4. Cross your fingers that somehow the shutdown=0 update makes it to attrd _after_ the DC sets shutdown=${large int} Under these conditions, the request to shut down will be lost and the DC will feel free to start services on the second node. Fix in: https://github.com/beekhof/pacemaker/commit/19d273a (In reply to Ken Gaillot from comment #26) > (In reply to zhu.rong from comment #23) > > Thanks.But I disabed all cluster-managed services in systemd. Can I change > > shutdown-escalation time to 10 minutes or 5 minutes? > > You can set it to whatever you want, but consider the maximum time it might > take to actually stop all your resources. If any resources are not finished > stopping when the escalation is reached, the cluster exits immediately. Thank you! (In reply to Ken Gaillot from comment #28) > (In reply to zhu.rong from comment #25) > > (In reply to Ken Gaillot from comment #22) > > > (In reply to zhu.rong from comment #21) > > > > (In reply to Ken Gaillot from comment #20) > > > > > (In reply to zhu.rong from comment #16) > > > > > > Created attachment 1188961 [details] > > > > > > crm_report > > > > > > > > > > > > It's appear again,I collected the corosync.log to the zip file. My major > > > > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl > > > > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply! > > > > > > > > > > An important log here is: > > > > > > > > > > Aug 8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for > > > > > 5964 ms, flushing membership messages. > > > > > > > > > > Aug 8 17:17:05 cvm50 corosync[1719]: [MAIN ] Corosync main process was not > > > > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token > > > > > timeout increase. > > > > > > > > > > which indicates that high load is preventing corosync from maintaining > > > > > communication. I do not think your issue is related to this bug report; the > > > > > issue can probably be resolved with some configuration changes and > > > > > investigation of the load issue. > > > > > > > > > > I recommend opening a case with Red Hat's Global Support Services group, > > > > > which > > > > > you can initiate through one of the methods listed at the following link: > > > > > > > > > > https://access.redhat.com/start/how-to-engage-red-hat-support > > > > > > > > > > Through a support case, we can address the full range of issues involved, > > > > > and recommend configuration improvements, beyond the scope of this bug > > > > > report. > > > > > > > > Thank you very muck! But it happened again,and there is no high load > > > > message!When I use systemctl stop corosync,it hangs on."systemctl status > > > > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log > > > > shows: > > > > > > > > cvm193 pacemakerd: notice: crm_signal_dispatch: Invoking handler > > > > for signal 15: Terminated > > > > cvm193 pacemakerd: notice: pcmk_shutdown_worker: Shuting down > > > > Pacemaker > > > > cvm193 pacemakerd: notice: stop_child: Stopping crmd: Sent -15 > > > > to process 25909 > > > > cvm193 crmd: notice: crm_signal_dispatch: Invoking handler > > > > for signal 15: Terminated > > > > cvm193 crmd: notice: crm_shutdown: Requesting shutdown, > > > > upper limit is 1200000ms > > > > cvm193 crmd: info: do_shutdown_req: Sending shutdown request > > > > to cvm194 > > > > cvm193 crmd: info: crm_update_peer_expected: handle_request: > > > > Node cvm193[1] - expected state is now down (was member) > > > > cvm193 attrd: info: attrd_peer_update: Setting shutdown[cvm193]: > > > > 0 -> 1470808578 from cvm194 > > > > > > > > What's the "Requesting shutdown, upper limit is 1200000ms" meanning? > > > > > > It's the standard message indicating the beginning of a shutdown, showing > > > the value of the shutdown-escalation cluster option (the default 20 minutes > > > in this case). That is a timeout for the shutdown process -- if the DC does > > > not stop all resources on the node and notify the node it is safe to > > > shutdown within that time, the node will immediately shut down. > > > > > > The root of the issue you're seeing and in the original report here is that > > > the DC is not telling the node it's OK to shut down, so the node waits until > > > shutdown-escalation time has passed, then shuts down. > > > > > > For your case, I'd recommend disabling all cluster-managed services in > > > systemd, configuring fencing, and tweaking the corosync configuration > > > (ensuring it's running real-time, and increasing the token timeout if > > > appropriate). > > "ensuring it's running real-time, and increasing the token timeout if > > appropriate". This mean set what property or someting else? > > I believe if you're running corosync 2.3.4-4 or later, it's realtime. For > the token timeout, see the corosync.conf man page. Thanks for you help! Hello,Ken Gaillot,I have another problem to consult you! I have three network cards on every node, when I use host name to setup cluster(pcs cluster setup),how can I let corosync communicate with others throw which the network card I want? Mapping the ip specified on the nic I want to use with host name in /etc/hosts firstly? Sorry to trouble you again. (In reply to zhu.rong from comment #32) > Hello,Ken Gaillot,I have another problem to consult you! I have three > network cards on every node, when I use host name to setup cluster(pcs > cluster setup),how can I let corosync communicate with others throw which > the network card I want? Mapping the ip specified on the nic I want to use > with host name in /etc/hosts firstly? Sorry to trouble you again. Hi Zhu, We'd like to request that you engage with Red Hat Support to further explore this topic and any other questions you might have on usage of the product or recommended practices. We need to keep this bug report focused on the problem it was originally opened for so that we can stay on track in resolving that issue. If you do not have a subscription with Red Hat support, you can also engage with members of the community to learn more about corosync and pacemaker. http://clusterlabs.org/mailman/listinfo/users is a useful starting point. You can also check out the content in the Red Hat Customer Portal at https://access.redhat.com, the corosync wiki at https://github.com/corosync/corosync/wiki, and the clusterlabs site at http://clusterlabs.org/. Thanks, John Ruemker Principal Software Maintenance Engineer Red Hat Support Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |