Hide Forgot
Description of problem: I'm trying to understand why this fence event occurred (Apr 13 16:31:22 -- log time). The pacemaker node is in standby mode and is fenced during cluster stop operation. The fenced node is running: pacemaker-1.1.12-22.el7_1.2.x86_64 It was being stopped to apply patches including upgrading to: pacemaker-1.1.13-10.el7_2.2.x86_64 pcmk-ocd2c003 is the node that was fenced; this node was also the active DC. pcmk-ocd2c003 was put into standby and then we waited until all resources were in stopped state. A few seconds after a cluster stop is executed on pcmk-ocd2c003 it is fenced. pcmk-ocd2c002 becomes the active DC. In the log I see that it reports pcmk-ocd2c0003 as unclean and then fences it. It also appear from the log that the new DC believe there are still active resources on pcmk-ocd2c0003. I will link pacemaker debug logs from pcmk-ocd2c0002 and pcmk-ocd2c0002 for review. Thanks for your help. Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.2.x86_64 Cluster being upgraded to 1.1.13-10.el7_2.2.x86_64 How reproducible: We attempted to reproduce but were unable Steps to Reproduce: 1. 2. 3. Actual results: cluster node is fenced Expected results: cluster node shuts down without being fenced Additional info: Will provide link to debug logs
Could I get a copy of this file please? Apr 13 16:31:22 [3927] ocd2c002.osinst.net pengine: warning: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-13.bz2
I have requested the file from the customer.
Ken: The old DC correctly indicates that it intends to leave the cluster (note the origin): Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.451.105 2 Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.451.106 (null) Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=106 Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @crm-debug-origin=do_dc_release, @expected=down Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c003/crmd/3142, version=0.451.106) And the new DC correctly notices and updates the crmd status: Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.452.15 2 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.452.16 (null) Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=16 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @crmd=offline, @crm-debug-origin=peer_update_callback Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/620, version=0.452.16) However it then goes and overwrites the expected status: Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.452.16 2 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.452.17 (null) Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=17 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @in_ccm=false, @crm-debug-origin=post_cache_update, @expected=member Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/625, version=0.452.17) This matches the crmd's internal view of the state: Apr 13 16:31:17 [3928] ocd2c002.osinst.net crmd: info: crm_update_peer_expected: update_dc: Node pcmk-ocd2c002[1] - expected state is now member (was (null)) Apr 13 16:31:17 [3928] ocd2c002.osinst.net crmd: info: crm_update_peer_expected: do_dc_join_filter_offer: Node pcmk-ocd2c004[2] - expected state is now member (was (null)) However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and received on ocd2c003): Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: do_shutdown_req: Sending shutdown request to pcmk-ocd2c003 Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: handle_shutdown_request: Creating shutdown request for pcmk-ocd2c003 (state=S_POLICY_ENGINE) That should mean it was/will be received on the other nodes too. However there is no equivalent of: Apr 13 16:31:17 [3447] ocd2c003.osinst.net crmd: info: crm_update_peer_expected: do_dc_release: Node pcmk-ocd2c003[3] - expected state is now down (was member) I suspect that corosync simply dropped the message (you'll want to talk about this with Chrissie), the question is - what should we do about it. Possibly we need something like what the cib has at shutdown, where all the peers need to ack it (which allows us to assume that all messages have arrvied) before it will go away.
Capacity constrained, moving to 7.4
(In reply to Andrew Beekhof from comment #7) > However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and > received on ocd2c003): > > Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: > do_shutdown_req: Sending shutdown request to pcmk-ocd2c003 > Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: > handle_shutdown_request: Creating shutdown request for pcmk-ocd2c003 > (state=S_POLICY_ENGINE) > > That should mean it was/will be received on the other nodes too. > However there is no equivalent of: > > Apr 13 16:31:17 [3447] ocd2c003.osinst.net crmd: info: > crm_update_peer_expected: do_dc_release: Node pcmk-ocd2c003[3] - expected > state is now down (was member) > > I suspect that corosync simply dropped the message (you'll want to talk > about this with Chrissie), the question is - what should we do about it. Nope, it's simpler than that: the old DC was running pacemaker 1.1.12-22.el7_1.2, which didn't send shutdown requests to all peers. The possibility of this issue here is why that change was made in pacemaker 1.1.13 and z-streamed to RHEL 7.1 as part of 1.1.12-22.el7_1.3. I hope that solves the mystery, and reassures the customer that it won't happen again on future upgrades.