Bug 1310486 - Pacemaker looses shutdown requests under some conditions
Pacemaker looses shutdown requests under some conditions
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.2
Unspecified Unspecified
medium Severity high
: rc
: 7.3
Assigned To: Ken Gaillot
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-21 19:22 EST by Andrew Beekhof
Modified: 2018-01-15 13:50 EST (History)
9 users (show)

See Also:
Fixed In Version: pacemaker-1.1.15-10.el7
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-03 14:58:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
crm_report (339.87 KB, application/x-bzip)
2016-08-08 02:15 EDT, zhu.rong
no flags Details
crm_report (738.27 KB, application/zip)
2016-08-08 20:29 EDT, zhu.rong
no flags Details
Reboot hangs on (1.99 MB, application/zip)
2016-08-08 23:44 EDT, zhu.rong
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3252151 None None None 2018-01-15 13:50 EST

  None (edit)
Description Andrew Beekhof 2016-02-21 19:22:57 EST
Description of problem:

A poorly timed (non-DC) shutdown request can be lost, resulting in a cluster that appears hung for shutdown-escalation seconds.
After this time, pacemaker on the peer will exiting without having shut down any of its services.

The DC will eventually notice this and fence the node (assuming it is enabled).

Version-Release number of selected component (if applicable):

Feb 18 18:23:48 [37360] stage-bare-ctrl-1.localdomain       crmd:   notice: main:	CRM Git Version: 1.1.13-10.el7 (44eb2dd)

How reproducible:

random

Steps to Reproduce:

1. shut down a node at just the wrong time (after a new DC is elected) 

Actual results:

Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: do_state_transition:	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: abort_transition_graph:	Transition aborted: Peer Cancelled (source=do_te_invoke:151, 1)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:  warning: match_down_event:	No match for shutdown action on 1
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: abort_transition_graph:	Transition aborted by deletion of lrm[@id='1']: Resource state removal (cib=0.171.16, source=abort_unless_down:333, path=/cib/status/node_state[@id='1']/lrm[@id='1'], 1)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: abort_transition_graph:	Transition aborted: LRM Refresh (source=process_resource_updates:266, 1)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: crm_update_peer_expected:	handle_request: Node stage-bare-ctrl-0[1] - expected state is now down (was member)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: handle_shutdown_request:	Creating shutdown request for stage-bare-ctrl-0 (state=S_POLICY_ENGINE)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: abort_transition_graph:	Transition aborted by status-1-shutdown, shutdown=1455858801: Transient attribute change (modify cib=0.171.23, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: abort_transition_graph:	Transition aborted by status-1-shutdown, shutdown=0: Transient attribute change (modify cib=0.171.24, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: do_state_transition:	State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info: do_te_invoke:	Processing graph 32 (ref=pe_calc-dc-1455858801-579) derived from /var/lib/pacemaker/pengine/pe-input-13.bz2

Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error: crm_timer_popped:    Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms)
Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error: lrm_state_verify_stopped:    41 resources were active at shutdown.

Expected results:

Node shuts down promptly.

Additional info:

It appears abort_transition_graph() is not correctly restarting the PE run (invalidating the existing query that predated the handle_shutdown_request() call).
Comment 2 Ken Gaillot 2016-05-16 12:32:45 EDT
This will be investigated in the 7.4 timeframe.
Comment 3 zhu.rong 2016-07-28 22:22:42 EDT
(In reply to Andrew Beekhof from comment #0)
> Description of problem:
> 
> A poorly timed (non-DC) shutdown request can be lost, resulting in a cluster
> that appears hung for shutdown-escalation seconds.
> After this time, pacemaker on the peer will exiting without having shut down
> any of its services.
> 
> The DC will eventually notice this and fence the node (assuming it is
> enabled).
> 
> Version-Release number of selected component (if applicable):
> 
> Feb 18 18:23:48 [37360] stage-bare-ctrl-1.localdomain       crmd:   notice:
> main:	CRM Git Version: 1.1.13-10.el7 (44eb2dd)
> 
> How reproducible:
> 
> random
> 
> Steps to Reproduce:
> 
> 1. shut down a node at just the wrong time (after a new DC is elected) 
> 
> Actual results:
> 
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> do_state_transition:	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [
> input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> abort_transition_graph:	Transition aborted: Peer Cancelled
> (source=do_te_invoke:151, 1)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:  warning:
> match_down_event:	No match for shutdown action on 1
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> abort_transition_graph:	Transition aborted by deletion of lrm[@id='1']:
> Resource state removal (cib=0.171.16, source=abort_unless_down:333,
> path=/cib/status/node_state[@id='1']/lrm[@id='1'], 1)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> abort_transition_graph:	Transition aborted: LRM Refresh
> (source=process_resource_updates:266, 1)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> crm_update_peer_expected:	handle_request: Node stage-bare-ctrl-0[1] -
> expected state is now down (was member)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> handle_shutdown_request:	Creating shutdown request for stage-bare-ctrl-0
> (state=S_POLICY_ENGINE)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> abort_transition_graph:	Transition aborted by status-1-shutdown,
> shutdown=1455858801: Transient attribute change (modify cib=0.171.23,
> source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/
> instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> abort_transition_graph:	Transition aborted by status-1-shutdown, shutdown=0:
> Transient attribute change (modify cib=0.171.24,
> source=abort_unless_down:319,
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/
> instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> do_state_transition:	State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE
> [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> do_te_invoke:	Processing graph 32 (ref=pe_calc-dc-1455858801-579) derived
> from /var/lib/pacemaker/pengine/pe-input-13.bz2
> 
> Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error:
> crm_timer_popped:    Shutdown Escalation (I_STOP) just popped in state
> S_NOT_DC! (1200000ms)
> Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error:
> lrm_state_verify_stopped:    41 resources were active at shutdown.
> 
> Expected results:
> 
> Node shuts down promptly.
> 
> Additional info:
> 
> It appears abort_transition_graph() is not correctly restarting the PE run
> (invalidating the existing query that predated the handle_shutdown_request()
> call).


hello,when I use pacemaker,there is three node in cluster.I set the maintenance-mode=true,then I run systemctl stop pacemaker and systemctl stop corosync on all node in turn. But sometimes there was one node take 20 minutes to shutdown pacemaker and this node is not DC node.The DC node is the last one to shutdown.Here is my corosync log, is this a bug?

Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    +  /cib:  @num_updates=27
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/191, version=0.16.27)
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/192, version=0.16.27)
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/193, version=0.16.27)
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    Diff: --- 0.16.27 2
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    Diff: +++ 0.16.28 (null)
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    +  /cib:  @num_updates=28, @dc-uuid=1
Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:       Completed cib_modify operation for section cib: OK (rc=0, origin=cvm193/crmd/58, version=0.16.28)
Jul 29 08:54:09 [16231] cvm194        cib:     info: cib_process_ping:  Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for 0.16.28 (0x28f3f90 0)
Jul 29 08:54:43 [16230] cvm194 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler for signal 15: Terminated
Jul 29 08:54:43 [16230] cvm194 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down Pacemaker
Connection closed by foreign host.emakerd:   notice: stop_child:        Stopping crmd: Sent -15 to process 16236
Jul 29 08:54:43 [16236] cvm194       crmd:   notice: crm_signal_dispatch:       Invoking handler for signal 15: Terminated
Type `help' to learn how to use Xshell prompt.otice: crm_shutdown:      Requesting shutdown, upper limit is 1200000ms
Xshell:\> 54:43 [16236] cvm194       crmd:     info: do_shutdown_req:   Sending shutdown request to cvm193
Jul 29 08:54:43 [16236] cvm194       crmd:     info: crm_update_peer_expected:  handle_request: Node cvm194[3] - expected state is now down (was (null))
Jul 29 08:54:43 [16234] cvm194      attrd:     info: attrd_peer_update: Setting shutdown[cvm194]: 0 -> 1469753683 from cvm193
Jul 29 08:54:43 [16231] cvm194        cib:     info: cib_process_request:       Completed cib_modify operation for section status: OK (rc=0, origin=cvm193/attrd/194, version=0.16.28)
Jul 29 08:54:48 [16231] cvm194        cib:     info: cib_process_ping:  Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for 0.16.28 (0x28f3f90 0)
Jul 29 09:14:43 [16236] cvm194       crmd:    error: crm_timer_popped:  Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms)
Jul 29 09:14:43 [16236] cvm194       crmd:   notice: do_state_transition:       State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_shutdown:       Disconnecting STONITH...
Jul 29 09:14:43 [16236] cvm194       crmd:     info: tengine_stonith_connection_destroy:        Fencing daemon disconnected
Jul 29 09:14:43 [16236] cvm194       crmd:   notice: lrm_state_verify_stopped:  Stopped 0 recurring operations at shutdown... waiting (0 ops remaining)
Jul 29 09:14:43 [16236] cvm194       crmd:    error: lrm_state_verify_stopped:  1 resources were active at shutdown.
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_lrm_control:    Disconnecting from the LRM
Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrmd_api_disconnect:       Disconnecting from 1 lrmd service
Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrmd_ipc_connection_destroy:       IPC connection destroyed
Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrm_connection_destroy:    LRM Connection disconnected
Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrmd_api_disconnect:       Disconnecting from 1 lrmd service
Jul 29 09:14:43 [16236] cvm194       crmd:   notice: do_lrm_control:    Disconnected from the LRM
Jul 29 09:14:43 [16236] cvm194       crmd:     info: crm_cluster_disconnect:    Disconnecting from cluster infrastructure: corosync
Jul 29 09:14:43 [16236] cvm194       crmd:   notice: terminate_cs_connection:   Disconnecting from Corosync
Jul 29 09:14:43 [16236] cvm194       crmd:     info: crm_cluster_disconnect:    Disconnected from corosync
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_ha_control:     Disconnected from the cluster
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_cib_control:    Disconnecting CIB
Jul 29 09:14:43 [16236] cvm194       crmd:     info: crmd_cib_connection_destroy:       Connection to the CIB terminated...
Jul 29 09:14:43 [16236] cvm194       crmd:     info: qb_ipcs_us_withdraw:       withdrawing server sockets
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_exit:   Performing A_EXIT_0 - gracefully exiting the CRMd
Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_exit:   [crmd] stopped (0)
Jul 29 09:14:43 [16236] cvm194       crmd:     info: crmd_exit: Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Comment 4 zhu.rong 2016-07-28 22:29:09 EDT
(In reply to zhu.rong from comment #3)
> (In reply to Andrew Beekhof from comment #0)
> > Description of problem:
> > 
> > A poorly timed (non-DC) shutdown request can be lost, resulting in a cluster
> > that appears hung for shutdown-escalation seconds.
> > After this time, pacemaker on the peer will exiting without having shut down
> > any of its services.
> > 
> > The DC will eventually notice this and fence the node (assuming it is
> > enabled).
> > 
> > Version-Release number of selected component (if applicable):
> > 
> > Feb 18 18:23:48 [37360] stage-bare-ctrl-1.localdomain       crmd:   notice:
> > main:	CRM Git Version: 1.1.13-10.el7 (44eb2dd)
> > 
> > How reproducible:
> > 
> > random
> > 
> > Steps to Reproduce:
> > 
> > 1. shut down a node at just the wrong time (after a new DC is elected) 
> > 
> > Actual results:
> > 
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > do_state_transition:	State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [
> > input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > abort_transition_graph:	Transition aborted: Peer Cancelled
> > (source=do_te_invoke:151, 1)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:  warning:
> > match_down_event:	No match for shutdown action on 1
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > abort_transition_graph:	Transition aborted by deletion of lrm[@id='1']:
> > Resource state removal (cib=0.171.16, source=abort_unless_down:333,
> > path=/cib/status/node_state[@id='1']/lrm[@id='1'], 1)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > abort_transition_graph:	Transition aborted: LRM Refresh
> > (source=process_resource_updates:266, 1)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > crm_update_peer_expected:	handle_request: Node stage-bare-ctrl-0[1] -
> > expected state is now down (was member)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > handle_shutdown_request:	Creating shutdown request for stage-bare-ctrl-0
> > (state=S_POLICY_ENGINE)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > abort_transition_graph:	Transition aborted by status-1-shutdown,
> > shutdown=1455858801: Transient attribute change (modify cib=0.171.23,
> > source=abort_unless_down:319,
> > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/
> > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > abort_transition_graph:	Transition aborted by status-1-shutdown, shutdown=0:
> > Transient attribute change (modify cib=0.171.24,
> > source=abort_unless_down:319,
> > path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/
> > instance_attributes[@id='status-1']/nvpair[@id='status-1-shutdown'], 1)
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > do_state_transition:	State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE
> > [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> > Feb 19 00:13:21 [4754] stage-bare-ctrl-1.localdomain       crmd:     info:
> > do_te_invoke:	Processing graph 32 (ref=pe_calc-dc-1455858801-579) derived
> > from /var/lib/pacemaker/pengine/pe-input-13.bz2
> > 
> > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error:
> > crm_timer_popped:    Shutdown Escalation (I_STOP) just popped in state
> > S_NOT_DC! (1200000ms)
> > Feb 19 00:33:21 [4959] stage-bare-ctrl-0.localdomain       crmd:    error:
> > lrm_state_verify_stopped:    41 resources were active at shutdown.
> > 
> > Expected results:
> > 
> > Node shuts down promptly.
> > 
> > Additional info:
> > 
> > It appears abort_transition_graph() is not correctly restarting the PE run
> > (invalidating the existing query that predated the handle_shutdown_request()
> > call).
> 
> 
> hello,when I use pacemaker,there are three node and six resources in cluster.I set the
> maintenance-mode=true,then I run systemctl stop pacemaker and systemctl stop
> corosync on all node in turn. But sometimes there was one node take 20
> minutes to shutdown pacemaker and this node is not DC node.The DC node is
> the last one to shutdown.Here is my corosync log, is this a bug?
> 
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    + 
> /cib:  @num_updates=27
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:   
> Completed cib_modify operation for section status: OK (rc=0,
> origin=cvm193/attrd/191, version=0.16.27)
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:   
> Completed cib_modify operation for section status: OK (rc=0,
> origin=cvm193/attrd/192, version=0.16.27)
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:   
> Completed cib_modify operation for section status: OK (rc=0,
> origin=cvm193/attrd/193, version=0.16.27)
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:   
> Diff: --- 0.16.27 2
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:   
> Diff: +++ 0.16.28 (null)
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_perform_op:    + 
> /cib:  @num_updates=28, @dc-uuid=1
> Jul 29 08:54:04 [16231] cvm194        cib:     info: cib_process_request:   
> Completed cib_modify operation for section cib: OK (rc=0,
> origin=cvm193/crmd/58, version=0.16.28)
> Jul 29 08:54:09 [16231] cvm194        cib:     info: cib_process_ping: 
> Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for
> 0.16.28 (0x28f3f90 0)
> Jul 29 08:54:43 [16230] cvm194 pacemakerd:   notice: crm_signal_dispatch:   
> Invoking handler for signal 15: Terminated
> Jul 29 08:54:43 [16230] cvm194 pacemakerd:   notice: pcmk_shutdown_worker:  
> Shuting down Pacemaker
> Connection closed by foreign host.emakerd:   notice: stop_child:       
> Stopping crmd: Sent -15 to process 16236
> Jul 29 08:54:43 [16236] cvm194       crmd:   notice: crm_signal_dispatch:   
> Invoking handler for signal 15: Terminated
> Type `help' to learn how to use Xshell prompt.otice: crm_shutdown:     
> Requesting shutdown, upper limit is 1200000ms
> Xshell:\> 54:43 [16236] cvm194       crmd:     info: do_shutdown_req:  
> Sending shutdown request to cvm193
> Jul 29 08:54:43 [16236] cvm194       crmd:     info:
> crm_update_peer_expected:  handle_request: Node cvm194[3] - expected state
> is now down (was (null))
> Jul 29 08:54:43 [16234] cvm194      attrd:     info: attrd_peer_update:
> Setting shutdown[cvm194]: 0 -> 1469753683 from cvm193
> Jul 29 08:54:43 [16231] cvm194        cib:     info: cib_process_request:   
> Completed cib_modify operation for section status: OK (rc=0,
> origin=cvm193/attrd/194, version=0.16.28)
> Jul 29 08:54:48 [16231] cvm194        cib:     info: cib_process_ping: 
> Reporting our current digest to cvm193: 0372dc59ea38d9a5635c8093bde14e61 for
> 0.16.28 (0x28f3f90 0)
> Jul 29 09:14:43 [16236] cvm194       crmd:    error: crm_timer_popped: 
> Shutdown Escalation (I_STOP) just popped in state S_NOT_DC! (1200000ms)
> Jul 29 09:14:43 [16236] cvm194       crmd:   notice: do_state_transition:   
> State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_shutdown:      
> Disconnecting STONITH...
> Jul 29 09:14:43 [16236] cvm194       crmd:     info:
> tengine_stonith_connection_destroy:        Fencing daemon disconnected
> Jul 29 09:14:43 [16236] cvm194       crmd:   notice:
> lrm_state_verify_stopped:  Stopped 0 recurring operations at shutdown...
> waiting (0 ops remaining)
> Jul 29 09:14:43 [16236] cvm194       crmd:    error:
> lrm_state_verify_stopped:  1 resources were active at shutdown.
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_lrm_control:   
> Disconnecting from the LRM
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrmd_api_disconnect:   
> Disconnecting from 1 lrmd service
> Jul 29 09:14:43 [16236] cvm194       crmd:     info:
> lrmd_ipc_connection_destroy:       IPC connection destroyed
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrm_connection_destroy:
> LRM Connection disconnected
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: lrmd_api_disconnect:   
> Disconnecting from 1 lrmd service
> Jul 29 09:14:43 [16236] cvm194       crmd:   notice: do_lrm_control:   
> Disconnected from the LRM
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: crm_cluster_disconnect:
> Disconnecting from cluster infrastructure: corosync
> Jul 29 09:14:43 [16236] cvm194       crmd:   notice:
> terminate_cs_connection:   Disconnecting from Corosync
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: crm_cluster_disconnect:
> Disconnected from corosync
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_ha_control:    
> Disconnected from the cluster
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_cib_control:   
> Disconnecting CIB
> Jul 29 09:14:43 [16236] cvm194       crmd:     info:
> crmd_cib_connection_destroy:       Connection to the CIB terminated...
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: qb_ipcs_us_withdraw:   
> withdrawing server sockets
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_exit:   Performing
> A_EXIT_0 - gracefully exiting the CRMd
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: do_exit:   [crmd]
> stopped (0)
> Jul 29 09:14:43 [16236] cvm194       crmd:     info: crmd_exit: Dropping
> I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]


my corosync and pacemaker is:
corosync-2.3.4-7.el7.x86_64
pacemaker-1.1.13-10.el7.x86_64
Comment 5 Andrew Beekhof 2016-07-28 22:59:36 EDT
Sounds like this bug.
Comment 6 zhu.rong 2016-07-28 23:15:04 EDT
(In reply to Andrew Beekhof from comment #5)
> Sounds like this bug.

think you so much!
Comment 7 Andrew Beekhof 2016-08-01 23:51:42 EDT
Can you attach sos reports from your 3 nodes please?
It would be useful to have some more recent logs
Comment 8 zhu.rong 2016-08-02 00:19:27 EDT
(In reply to Andrew Beekhof from comment #7)
> Can you attach sos reports from your 3 nodes please?
> It would be useful to have some more recent logs

Sorry,please tell me what sos reports is? And the environment is not exists at this time.
Comment 9 Ken Gaillot 2016-08-05 10:55:15 EDT
(In reply to zhu.rong from comment #8)
> (In reply to Andrew Beekhof from comment #7)
> > Can you attach sos reports from your 3 nodes please?
> > It would be useful to have some more recent logs
> 
> Sorry,please tell me what sos reports is? And the environment is not exists
> at this time.

sosreport is a tool for collecting logs and configuration from a cluster for troubleshooting purposes. But if you don't have access to the environment any longer, it doesn't matter.

Based on the logs here, I see one possible way this might happen, but I would expect it to be possible only if the affected node is rebooting (or at least restarting cluster services) at the same time the cluster is shutting down. Additional logs could help confirm whether that is the case here.

If anyone sees this issue happen again, please attach the output of either sosreport or crm_report here. In the meantime, I will address the issue I've found, and if we receive no further information, I'll assume it take cares of this issue.
Comment 10 zhu.rong 2016-08-06 02:14:38 EDT
(In reply to Ken Gaillot from comment #9)
> (In reply to zhu.rong from comment #8)
> > (In reply to Andrew Beekhof from comment #7)
> > > Can you attach sos reports from your 3 nodes please?
> > > It would be useful to have some more recent logs
> > 
> > Sorry,please tell me what sos reports is? And the environment is not exists
> > at this time.
> 
> sosreport is a tool for collecting logs and configuration from a cluster for
> troubleshooting purposes. But if you don't have access to the environment
> any longer, it doesn't matter.
> 
> Based on the logs here, I see one possible way this might happen, but I
> would expect it to be possible only if the affected node is rebooting (or at
> least restarting cluster services) at the same time the cluster is shutting
> down. Additional logs could help confirm whether that is the case here.
> 
> If anyone sees this issue happen again, please attach the output of either
> sosreport or crm_report here. In the meantime, I will address the issue I've
> found, and if we receive no further information, I'll assume it take cares
> of this issue.

Thank you! Sometimes it happens in my work recently,if it happens again,I'll attach the crm_report here.
Comment 11 zhu.rong 2016-08-08 02:15 EDT
Created attachment 1188541 [details]
crm_report

This is the crm_report of my cluster. I have a three node pacemaker cluster,when entered maintenance-mode and then use systemctl stop corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl status pacemaker,it's show me deactivating.But I just use systemctl stop corosync.
Comment 12 zhu.rong 2016-08-08 03:02:00 EDT
(In reply to zhu.rong from comment #11)
> Created attachment 1188541 [details]
> crm_report
> 
> This is the crm_report of my cluster. I have a three node pacemaker
> cluster,when entered maintenance-mode and then use systemctl stop
> corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl
> status pacemaker,it's show me deactivating.But I just use systemctl stop
> corosync.

Sorry,there is no /var/log/cluster/corosync.conf log, I'll collect again when next time hangs on.
Comment 13 Ken Gaillot 2016-08-08 11:32:20 EDT
(In reply to zhu.rong from comment #12)
> (In reply to zhu.rong from comment #11)
> > Created attachment 1188541 [details]
> > crm_report
> > 
> > This is the crm_report of my cluster. I have a three node pacemaker
> > cluster,when entered maintenance-mode and then use systemctl stop
> > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl
> > status pacemaker,it's show me deactivating.But I just use systemctl stop
> > corosync.
> 
> Sorry,there is no /var/log/cluster/corosync.conf log, I'll collect again
> when next time hangs on.

Often, the system log is sufficient, but here it's mostly empty, so it's difficult to see what's going on.

From the live CIB, I can see that node cvm50 has not completed an election (join="down") even though its crmd peer membership is active (in_ccm="true" and expected="member"). Without the logs, I don't know why this happened, but it doesn't sound like the issue I found, since the node must have previously joined successfully.

If you can get the system log and/or corosync.log next time it happens, that would help.

Also, I would recommend configuring fencing, as it is required to recover from certain failure scenarios.
Comment 14 Ken Gaillot 2016-08-08 11:35:21 EDT
(In reply to zhu.rong from comment #11)
> Created attachment 1188541 [details]
> crm_report
> 
> This is the crm_report of my cluster. I have a three node pacemaker
> cluster,when entered maintenance-mode and then use systemctl stop
> corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl
> status pacemaker,it's show me deactivating.But I just use systemctl stop
> corosync.

Somehow I overlooked this before responding. Corosync should not be stopped while pacemaker is running, even in maintenance mode. In maintenance mode, it is safe to stop pacemaker -- it won't stop any running resources. The recommended command is "pcs cluster stop".
Comment 15 Ken Gaillot 2016-08-08 11:51:34 EDT
(In reply to Ken Gaillot from comment #14)
> (In reply to zhu.rong from comment #11)
> > Created attachment 1188541 [details]
> > crm_report
> > 
> > This is the crm_report of my cluster. I have a three node pacemaker
> > cluster,when entered maintenance-mode and then use systemctl stop
> > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl
> > status pacemaker,it's show me deactivating.But I just use systemctl stop
> > corosync.
> 
> Somehow I overlooked this before responding. Corosync should not be stopped
> while pacemaker is running, even in maintenance mode. In maintenance mode,
> it is safe to stop pacemaker -- it won't stop any running resources. The
> recommended command is "pcs cluster stop".

To correct myself: "systemctl stop corosync" *should* work, because systemd should stop pacemaker first as a dependency of corosync.

More logs should help figure out what went wrong.
Comment 16 zhu.rong 2016-08-08 20:29 EDT
Created attachment 1188961 [details]
crm_report

It's appear again,I collected the corosync.log to the zip file. My major resource is running on cvm51,then I entered maintenance-mode,use systemctl stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
Comment 17 zhu.rong 2016-08-08 21:23:04 EDT
(In reply to Ken Gaillot from comment #14)
> (In reply to zhu.rong from comment #11)
> > Created attachment 1188541 [details]
> > crm_report
> > 
> > This is the crm_report of my cluster. I have a three node pacemaker
> > cluster,when entered maintenance-mode and then use systemctl stop
> > corosync,but it hangs 20 minutes.I check the pacemaker status use systemctl
> > status pacemaker,it's show me deactivating.But I just use systemctl stop
> > corosync.
> 
> Somehow I overlooked this before responding. Corosync should not be stopped
> while pacemaker is running, even in maintenance mode. In maintenance mode,
> it is safe to stop pacemaker -- it won't stop any running resources. The
> recommended command is "pcs cluster stop".

I use the "pcs cluster stop" command,but it still hangs on 20 minutes.
Comment 18 zhu.rong 2016-08-08 23:44 EDT
Created attachment 1189012 [details]
Reboot hangs on

Hello,there is another hangs on problem when I reboot the pacemaker cluster node.I reboot at about 09:45,and it show me the error "Failed to start engine.service: Transaction is destructive". The engine.service is the resource running on this node,I don't known why pacemaker always try to start the engine service when shutdown.Thank you!
Comment 19 Ken Gaillot 2016-08-09 10:16:53 EDT
(In reply to zhu.rong from comment #18)
> Created attachment 1189012 [details]
> Reboot hangs on
> 
> Hello,there is another hangs on problem when I reboot the pacemaker cluster
> node.I reboot at about 09:45,and it show me the error "Failed to start
> engine.service: Transaction is destructive". The engine.service is the
> resource running on this node,I don't known why pacemaker always try to
> start the engine service when shutdown.Thank you!

That message is from systemd, rather than pacemaker. It indicates that engine.service is enabled in systemd. If engine is supposed to be managed by pacemaker, run "systemctl disable engine.service" to take it out of systemd's control.
Comment 20 Ken Gaillot 2016-08-09 10:23:24 EDT
(In reply to zhu.rong from comment #16)
> Created attachment 1188961 [details]
> crm_report
> 
> It's appear again,I collected the corosync.log to the zip file. My major
> resource is running on cvm51,then I entered maintenance-mode,use systemctl
> stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!

An important log here is:

Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for 5964 ms, flushing membership messages.

Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token timeout increase.

which indicates that high load is preventing corosync from maintaining communication. I do not think your issue is related to this bug report; the issue can probably be resolved with some configuration changes and investigation of the load issue.

I recommend opening a case with Red Hat's Global Support Services group, which
you can initiate through one of the methods listed at the following link:

  https://access.redhat.com/start/how-to-engage-red-hat-support

Through a support case, we can address the full range of issues involved, and recommend configuration improvements, beyond the scope of this bug report.
Comment 21 zhu.rong 2016-08-10 02:31:42 EDT
(In reply to Ken Gaillot from comment #20)
> (In reply to zhu.rong from comment #16)
> > Created attachment 1188961 [details]
> > crm_report
> > 
> > It's appear again,I collected the corosync.log to the zip file. My major
> > resource is running on cvm51,then I entered maintenance-mode,use systemctl
> > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
> 
> An important log here is:
> 
> Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for
> 5964 ms, flushing membership messages.
> 
> Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not
> scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token
> timeout increase.
> 
> which indicates that high load is preventing corosync from maintaining
> communication. I do not think your issue is related to this bug report; the
> issue can probably be resolved with some configuration changes and
> investigation of the load issue.
> 
> I recommend opening a case with Red Hat's Global Support Services group,
> which
> you can initiate through one of the methods listed at the following link:
> 
>   https://access.redhat.com/start/how-to-engage-red-hat-support
> 
> Through a support case, we can address the full range of issues involved,
> and recommend configuration improvements, beyond the scope of this bug
> report.

Thank you very muck! But it happened again,and there is no high load message!When I use systemctl stop corosync,it hangs on."systemctl status pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log shows:

   cvm193 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler for signal 15: Terminated
   cvm193 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down Pacemaker
   cvm193 pacemakerd:   notice: stop_child:        Stopping crmd: Sent -15 to process 25909
   cvm193       crmd:   notice: crm_signal_dispatch:       Invoking handler for signal 15: Terminated
   cvm193       crmd:   notice: crm_shutdown:      Requesting shutdown, upper limit is 1200000ms
   cvm193       crmd:     info: do_shutdown_req:   Sending shutdown request to cvm194
   cvm193       crmd:     info: crm_update_peer_expected:  handle_request: Node cvm193[1] - expected state is now down (was member)
   cvm193      attrd:     info: attrd_peer_update: Setting shutdown[cvm193]: 0 -> 1470808578 from cvm194

    What's the "Requesting shutdown, upper limit is 1200000ms" meanning?
Comment 22 Ken Gaillot 2016-08-10 16:18:02 EDT
(In reply to zhu.rong from comment #21)
> (In reply to Ken Gaillot from comment #20)
> > (In reply to zhu.rong from comment #16)
> > > Created attachment 1188961 [details]
> > > crm_report
> > > 
> > > It's appear again,I collected the corosync.log to the zip file. My major
> > > resource is running on cvm51,then I entered maintenance-mode,use systemctl
> > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
> > 
> > An important log here is:
> > 
> > Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for
> > 5964 ms, flushing membership messages.
> > 
> > Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not
> > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token
> > timeout increase.
> > 
> > which indicates that high load is preventing corosync from maintaining
> > communication. I do not think your issue is related to this bug report; the
> > issue can probably be resolved with some configuration changes and
> > investigation of the load issue.
> > 
> > I recommend opening a case with Red Hat's Global Support Services group,
> > which
> > you can initiate through one of the methods listed at the following link:
> > 
> >   https://access.redhat.com/start/how-to-engage-red-hat-support
> > 
> > Through a support case, we can address the full range of issues involved,
> > and recommend configuration improvements, beyond the scope of this bug
> > report.
> 
> Thank you very muck! But it happened again,and there is no high load
> message!When I use systemctl stop corosync,it hangs on."systemctl status
> pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log
> shows:
> 
>    cvm193 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler
> for signal 15: Terminated
>    cvm193 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down
> Pacemaker
>    cvm193 pacemakerd:   notice: stop_child:        Stopping crmd: Sent -15
> to process 25909
>    cvm193       crmd:   notice: crm_signal_dispatch:       Invoking handler
> for signal 15: Terminated
>    cvm193       crmd:   notice: crm_shutdown:      Requesting shutdown,
> upper limit is 1200000ms
>    cvm193       crmd:     info: do_shutdown_req:   Sending shutdown request
> to cvm194
>    cvm193       crmd:     info: crm_update_peer_expected:  handle_request:
> Node cvm193[1] - expected state is now down (was member)
>    cvm193      attrd:     info: attrd_peer_update: Setting shutdown[cvm193]:
> 0 -> 1470808578 from cvm194
> 
>     What's the "Requesting shutdown, upper limit is 1200000ms" meanning?

It's the standard message indicating the beginning of a shutdown, showing the value of the shutdown-escalation cluster option (the default 20 minutes in this case). That is a timeout for the shutdown process -- if the DC does not stop all resources on the node and notify the node it is safe to shutdown within that time, the node will immediately shut down.

The root of the issue you're seeing and in the original report here is that the DC is not telling the node it's OK to shut down, so the node waits until shutdown-escalation time has passed, then shuts down.

For your case, I'd recommend disabling all cluster-managed services in systemd, configuring fencing, and tweaking the corosync configuration (ensuring it's running real-time, and increasing the token timeout if appropriate).
Comment 23 zhu.rong 2016-08-10 20:45:30 EDT
Thanks.But I disabed all cluster-managed services in systemd. Can I change shutdown-escalation time to 10 minutes or 5 minutes?
Comment 24 zhu.rong 2016-08-10 21:15:29 EDT
(In reply to Ken Gaillot from comment #19)
> (In reply to zhu.rong from comment #18)
> > Created attachment 1189012 [details]
> > Reboot hangs on
> > 
> > Hello,there is another hangs on problem when I reboot the pacemaker cluster
> > node.I reboot at about 09:45,and it show me the error "Failed to start
> > engine.service: Transaction is destructive". The engine.service is the
> > resource running on this node,I don't known why pacemaker always try to
> > start the engine service when shutdown.Thank you!
> 
> That message is from systemd, rather than pacemaker. It indicates that
> engine.service is enabled in systemd. If engine is supposed to be managed by
> pacemaker, run "systemctl disable engine.service" to take it out of
> systemd's control.

Yes,I'm sure I disabled the engine.service befroe hangs on.
Comment 25 zhu.rong 2016-08-10 21:17:39 EDT
(In reply to Ken Gaillot from comment #22)
> (In reply to zhu.rong from comment #21)
> > (In reply to Ken Gaillot from comment #20)
> > > (In reply to zhu.rong from comment #16)
> > > > Created attachment 1188961 [details]
> > > > crm_report
> > > > 
> > > > It's appear again,I collected the corosync.log to the zip file. My major
> > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl
> > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
> > > 
> > > An important log here is:
> > > 
> > > Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for
> > > 5964 ms, flushing membership messages.
> > > 
> > > Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not
> > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token
> > > timeout increase.
> > > 
> > > which indicates that high load is preventing corosync from maintaining
> > > communication. I do not think your issue is related to this bug report; the
> > > issue can probably be resolved with some configuration changes and
> > > investigation of the load issue.
> > > 
> > > I recommend opening a case with Red Hat's Global Support Services group,
> > > which
> > > you can initiate through one of the methods listed at the following link:
> > > 
> > >   https://access.redhat.com/start/how-to-engage-red-hat-support
> > > 
> > > Through a support case, we can address the full range of issues involved,
> > > and recommend configuration improvements, beyond the scope of this bug
> > > report.
> > 
> > Thank you very muck! But it happened again,and there is no high load
> > message!When I use systemctl stop corosync,it hangs on."systemctl status
> > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log
> > shows:
> > 
> >    cvm193 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler
> > for signal 15: Terminated
> >    cvm193 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down
> > Pacemaker
> >    cvm193 pacemakerd:   notice: stop_child:        Stopping crmd: Sent -15
> > to process 25909
> >    cvm193       crmd:   notice: crm_signal_dispatch:       Invoking handler
> > for signal 15: Terminated
> >    cvm193       crmd:   notice: crm_shutdown:      Requesting shutdown,
> > upper limit is 1200000ms
> >    cvm193       crmd:     info: do_shutdown_req:   Sending shutdown request
> > to cvm194
> >    cvm193       crmd:     info: crm_update_peer_expected:  handle_request:
> > Node cvm193[1] - expected state is now down (was member)
> >    cvm193      attrd:     info: attrd_peer_update: Setting shutdown[cvm193]:
> > 0 -> 1470808578 from cvm194
> > 
> >     What's the "Requesting shutdown, upper limit is 1200000ms" meanning?
> 
> It's the standard message indicating the beginning of a shutdown, showing
> the value of the shutdown-escalation cluster option (the default 20 minutes
> in this case). That is a timeout for the shutdown process -- if the DC does
> not stop all resources on the node and notify the node it is safe to
> shutdown within that time, the node will immediately shut down.
> 
> The root of the issue you're seeing and in the original report here is that
> the DC is not telling the node it's OK to shut down, so the node waits until
> shutdown-escalation time has passed, then shuts down.
> 
> For your case, I'd recommend disabling all cluster-managed services in
> systemd, configuring fencing, and tweaking the corosync configuration
> (ensuring it's running real-time, and increasing the token timeout if
> appropriate).
"ensuring it's running real-time, and increasing the token timeout if
appropriate". This mean set what property or someting else?
Comment 26 Ken Gaillot 2016-08-11 15:03:35 EDT
(In reply to zhu.rong from comment #23)
> Thanks.But I disabed all cluster-managed services in systemd. Can I change
> shutdown-escalation time to 10 minutes or 5 minutes?

You can set it to whatever you want, but consider the maximum time it might take to actually stop all your resources. If any resources are not finished stopping when the escalation is reached, the cluster exits immediately.
Comment 27 Ken Gaillot 2016-08-11 15:08:01 EDT
(In reply to zhu.rong from comment #24)
> (In reply to Ken Gaillot from comment #19)
> > (In reply to zhu.rong from comment #18)
> > > Created attachment 1189012 [details]
> > > Reboot hangs on
> > > 
> > > Hello,there is another hangs on problem when I reboot the pacemaker cluster
> > > node.I reboot at about 09:45,and it show me the error "Failed to start
> > > engine.service: Transaction is destructive". The engine.service is the
> > > resource running on this node,I don't known why pacemaker always try to
> > > start the engine service when shutdown.Thank you!
> > 
> > That message is from systemd, rather than pacemaker. It indicates that
> > engine.service is enabled in systemd. If engine is supposed to be managed by
> > pacemaker, run "systemctl disable engine.service" to take it out of
> > systemd's control.
> 
> Yes,I'm sure I disabled the engine.service befroe hangs on.

"Transaction is destructive" is a message from systemd. You may want to search for that online to figure out why that's happening.
Comment 28 Ken Gaillot 2016-08-11 15:11:14 EDT
(In reply to zhu.rong from comment #25)
> (In reply to Ken Gaillot from comment #22)
> > (In reply to zhu.rong from comment #21)
> > > (In reply to Ken Gaillot from comment #20)
> > > > (In reply to zhu.rong from comment #16)
> > > > > Created attachment 1188961 [details]
> > > > > crm_report
> > > > > 
> > > > > It's appear again,I collected the corosync.log to the zip file. My major
> > > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl
> > > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
> > > > 
> > > > An important log here is:
> > > > 
> > > > Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for
> > > > 5964 ms, flushing membership messages.
> > > > 
> > > > Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not
> > > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token
> > > > timeout increase.
> > > > 
> > > > which indicates that high load is preventing corosync from maintaining
> > > > communication. I do not think your issue is related to this bug report; the
> > > > issue can probably be resolved with some configuration changes and
> > > > investigation of the load issue.
> > > > 
> > > > I recommend opening a case with Red Hat's Global Support Services group,
> > > > which
> > > > you can initiate through one of the methods listed at the following link:
> > > > 
> > > >   https://access.redhat.com/start/how-to-engage-red-hat-support
> > > > 
> > > > Through a support case, we can address the full range of issues involved,
> > > > and recommend configuration improvements, beyond the scope of this bug
> > > > report.
> > > 
> > > Thank you very muck! But it happened again,and there is no high load
> > > message!When I use systemctl stop corosync,it hangs on."systemctl status
> > > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log
> > > shows:
> > > 
> > >    cvm193 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler
> > > for signal 15: Terminated
> > >    cvm193 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down
> > > Pacemaker
> > >    cvm193 pacemakerd:   notice: stop_child:        Stopping crmd: Sent -15
> > > to process 25909
> > >    cvm193       crmd:   notice: crm_signal_dispatch:       Invoking handler
> > > for signal 15: Terminated
> > >    cvm193       crmd:   notice: crm_shutdown:      Requesting shutdown,
> > > upper limit is 1200000ms
> > >    cvm193       crmd:     info: do_shutdown_req:   Sending shutdown request
> > > to cvm194
> > >    cvm193       crmd:     info: crm_update_peer_expected:  handle_request:
> > > Node cvm193[1] - expected state is now down (was member)
> > >    cvm193      attrd:     info: attrd_peer_update: Setting shutdown[cvm193]:
> > > 0 -> 1470808578 from cvm194
> > > 
> > >     What's the "Requesting shutdown, upper limit is 1200000ms" meanning?
> > 
> > It's the standard message indicating the beginning of a shutdown, showing
> > the value of the shutdown-escalation cluster option (the default 20 minutes
> > in this case). That is a timeout for the shutdown process -- if the DC does
> > not stop all resources on the node and notify the node it is safe to
> > shutdown within that time, the node will immediately shut down.
> > 
> > The root of the issue you're seeing and in the original report here is that
> > the DC is not telling the node it's OK to shut down, so the node waits until
> > shutdown-escalation time has passed, then shuts down.
> > 
> > For your case, I'd recommend disabling all cluster-managed services in
> > systemd, configuring fencing, and tweaking the corosync configuration
> > (ensuring it's running real-time, and increasing the token timeout if
> > appropriate).
> "ensuring it's running real-time, and increasing the token timeout if
> appropriate". This mean set what property or someting else?

I believe if you're running corosync 2.3.4-4 or later, it's realtime. For the token timeout, see the corosync.conf man page.
Comment 29 Andrew Beekhof 2016-08-11 20:06:33 EDT
Here is the sequence for reproducing the original bug:

1. Start a node, wait for it to become the DC
2. Start a second node
3. Tell the second node to stop while it is in the process of negotiating with the DC.
   Specifically just after do_cl_join_finalize_respond() is called on the second node.
4. Cross your fingers that somehow the shutdown=0 update makes it to attrd _after_ the DC sets shutdown=${large int}

Under these conditions, the request to shut down will be lost and the DC will feel free to start services on the second node.

Fix in:  
  https://github.com/beekhof/pacemaker/commit/19d273a
Comment 30 zhu.rong 2016-08-11 20:28:00 EDT
(In reply to Ken Gaillot from comment #26)
> (In reply to zhu.rong from comment #23)
> > Thanks.But I disabed all cluster-managed services in systemd. Can I change
> > shutdown-escalation time to 10 minutes or 5 minutes?
> 
> You can set it to whatever you want, but consider the maximum time it might
> take to actually stop all your resources. If any resources are not finished
> stopping when the escalation is reached, the cluster exits immediately.

Thank you!
Comment 31 zhu.rong 2016-08-11 20:29:33 EDT
(In reply to Ken Gaillot from comment #28)
> (In reply to zhu.rong from comment #25)
> > (In reply to Ken Gaillot from comment #22)
> > > (In reply to zhu.rong from comment #21)
> > > > (In reply to Ken Gaillot from comment #20)
> > > > > (In reply to zhu.rong from comment #16)
> > > > > > Created attachment 1188961 [details]
> > > > > > crm_report
> > > > > > 
> > > > > > It's appear again,I collected the corosync.log to the zip file. My major
> > > > > > resource is running on cvm51,then I entered maintenance-mode,use systemctl
> > > > > > stop corosync on cvm50 at about 17:21,then it hangs on.Thanks for your reply!
> > > > > 
> > > > > An important log here is:
> > > > > 
> > > > > Aug  8 17:17:03 cvm50 corosync[1719]: [TOTEM ] Process pause detected for
> > > > > 5964 ms, flushing membership messages.
> > > > > 
> > > > > Aug  8 17:17:05 cvm50 corosync[1719]: [MAIN  ] Corosync main process was not
> > > > > scheduled for 8418.6572 ms (threshold is 1320.0000 ms). Consider token
> > > > > timeout increase.
> > > > > 
> > > > > which indicates that high load is preventing corosync from maintaining
> > > > > communication. I do not think your issue is related to this bug report; the
> > > > > issue can probably be resolved with some configuration changes and
> > > > > investigation of the load issue.
> > > > > 
> > > > > I recommend opening a case with Red Hat's Global Support Services group,
> > > > > which
> > > > > you can initiate through one of the methods listed at the following link:
> > > > > 
> > > > >   https://access.redhat.com/start/how-to-engage-red-hat-support
> > > > > 
> > > > > Through a support case, we can address the full range of issues involved,
> > > > > and recommend configuration improvements, beyond the scope of this bug
> > > > > report.
> > > > 
> > > > Thank you very muck! But it happened again,and there is no high load
> > > > message!When I use systemctl stop corosync,it hangs on."systemctl status
> > > > pacemaker" show than pacemaker is deactiving,bug hangs on.The corosync.log
> > > > shows:
> > > > 
> > > >    cvm193 pacemakerd:   notice: crm_signal_dispatch:       Invoking handler
> > > > for signal 15: Terminated
> > > >    cvm193 pacemakerd:   notice: pcmk_shutdown_worker:      Shuting down
> > > > Pacemaker
> > > >    cvm193 pacemakerd:   notice: stop_child:        Stopping crmd: Sent -15
> > > > to process 25909
> > > >    cvm193       crmd:   notice: crm_signal_dispatch:       Invoking handler
> > > > for signal 15: Terminated
> > > >    cvm193       crmd:   notice: crm_shutdown:      Requesting shutdown,
> > > > upper limit is 1200000ms
> > > >    cvm193       crmd:     info: do_shutdown_req:   Sending shutdown request
> > > > to cvm194
> > > >    cvm193       crmd:     info: crm_update_peer_expected:  handle_request:
> > > > Node cvm193[1] - expected state is now down (was member)
> > > >    cvm193      attrd:     info: attrd_peer_update: Setting shutdown[cvm193]:
> > > > 0 -> 1470808578 from cvm194
> > > > 
> > > >     What's the "Requesting shutdown, upper limit is 1200000ms" meanning?
> > > 
> > > It's the standard message indicating the beginning of a shutdown, showing
> > > the value of the shutdown-escalation cluster option (the default 20 minutes
> > > in this case). That is a timeout for the shutdown process -- if the DC does
> > > not stop all resources on the node and notify the node it is safe to
> > > shutdown within that time, the node will immediately shut down.
> > > 
> > > The root of the issue you're seeing and in the original report here is that
> > > the DC is not telling the node it's OK to shut down, so the node waits until
> > > shutdown-escalation time has passed, then shuts down.
> > > 
> > > For your case, I'd recommend disabling all cluster-managed services in
> > > systemd, configuring fencing, and tweaking the corosync configuration
> > > (ensuring it's running real-time, and increasing the token timeout if
> > > appropriate).
> > "ensuring it's running real-time, and increasing the token timeout if
> > appropriate". This mean set what property or someting else?
> 
> I believe if you're running corosync 2.3.4-4 or later, it's realtime. For
> the token timeout, see the corosync.conf man page.

Thanks for you help!
Comment 32 zhu.rong 2016-08-12 04:09:38 EDT
Hello,Ken Gaillot,I have another problem to consult you! I have three network cards on every node, when I use host name to setup cluster(pcs cluster setup),how can I let corosync communicate with others throw which the network card I want? Mapping the ip specified on the nic I want to use with host name in /etc/hosts firstly? Sorry to trouble you again.
Comment 33 John Ruemker 2016-08-12 09:25:11 EDT
(In reply to zhu.rong from comment #32)
> Hello,Ken Gaillot,I have another problem to consult you! I have three
> network cards on every node, when I use host name to setup cluster(pcs
> cluster setup),how can I let corosync communicate with others throw which
> the network card I want? Mapping the ip specified on the nic I want to use
> with host name in /etc/hosts firstly? Sorry to trouble you again.

Hi Zhu,
We'd like to request that you engage with Red Hat Support to further explore this topic and any other questions you might have on usage of the product or recommended practices.  We need to keep this bug report focused on the problem it was originally opened for so that we can stay on track in resolving that issue.  

If you do not have a subscription with Red Hat support, you can also engage with members of the community to learn more about corosync and pacemaker.  http://clusterlabs.org/mailman/listinfo/users is a useful starting point.  You can also check out the content in the Red Hat Customer Portal at https://access.redhat.com, the corosync wiki at https://github.com/corosync/corosync/wiki, and the clusterlabs site at http://clusterlabs.org/.  

Thanks,
John Ruemker
Principal Software Maintenance Engineer
Red Hat Support
Comment 38 errata-xmlrpc 2016-11-03 14:58:46 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.