| Summary: | Pacemaker node in standby state is fenced during cluster stop operation | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Matt Flusche <mflusche> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.2 | CC: | abeekhof, cluster-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-08-16 22:12:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Matt Flusche
2016-04-15 15:33:50 UTC
Could I get a copy of this file please? Apr 13 16:31:22 [3927] ocd2c002.osinst.net pengine: warning: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-13.bz2 I have requested the file from the customer. Ken: The old DC correctly indicates that it intends to leave the cluster (note the origin): Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.451.105 2 Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.451.106 (null) Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=106 Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @crm-debug-origin=do_dc_release, @expected=down Apr 13 16:31:17 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c003/crmd/3142, version=0.451.106) And the new DC correctly notices and updates the crmd status: Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.452.15 2 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.452.16 (null) Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=16 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @crmd=offline, @crm-debug-origin=peer_update_callback Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/620, version=0.452.16) However it then goes and overwrites the expected status: Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: --- 0.452.16 2 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: Diff: +++ 0.452.17 (null) Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib: @num_updates=17 Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_perform_op: + /cib/status/node_state[@id='3']: @in_ccm=false, @crm-debug-origin=post_cache_update, @expected=member Apr 13 16:31:20 [3923] ocd2c002.osinst.net cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/625, version=0.452.17) This matches the crmd's internal view of the state: Apr 13 16:31:17 [3928] ocd2c002.osinst.net crmd: info: crm_update_peer_expected: update_dc: Node pcmk-ocd2c002[1] - expected state is now member (was (null)) Apr 13 16:31:17 [3928] ocd2c002.osinst.net crmd: info: crm_update_peer_expected: do_dc_join_filter_offer: Node pcmk-ocd2c004[2] - expected state is now member (was (null)) However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and received on ocd2c003): Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: do_shutdown_req: Sending shutdown request to pcmk-ocd2c003 Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: handle_shutdown_request: Creating shutdown request for pcmk-ocd2c003 (state=S_POLICY_ENGINE) That should mean it was/will be received on the other nodes too. However there is no equivalent of: Apr 13 16:31:17 [3447] ocd2c003.osinst.net crmd: info: crm_update_peer_expected: do_dc_release: Node pcmk-ocd2c003[3] - expected state is now down (was member) I suspect that corosync simply dropped the message (you'll want to talk about this with Chrissie), the question is - what should we do about it. Possibly we need something like what the cib has at shutdown, where all the peers need to ack it (which allows us to assume that all messages have arrvied) before it will go away. Capacity constrained, moving to 7.4 (In reply to Andrew Beekhof from comment #7) > However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and > received on ocd2c003): > > Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: > do_shutdown_req: Sending shutdown request to pcmk-ocd2c003 > Apr 13 16:31:16 [3447] ocd2c003.osinst.net crmd: info: > handle_shutdown_request: Creating shutdown request for pcmk-ocd2c003 > (state=S_POLICY_ENGINE) > > That should mean it was/will be received on the other nodes too. > However there is no equivalent of: > > Apr 13 16:31:17 [3447] ocd2c003.osinst.net crmd: info: > crm_update_peer_expected: do_dc_release: Node pcmk-ocd2c003[3] - expected > state is now down (was member) > > I suspect that corosync simply dropped the message (you'll want to talk > about this with Chrissie), the question is - what should we do about it. Nope, it's simpler than that: the old DC was running pacemaker 1.1.12-22.el7_1.2, which didn't send shutdown requests to all peers. The possibility of this issue here is why that change was made in pacemaker 1.1.13 and z-streamed to RHEL 7.1 as part of 1.1.12-22.el7_1.3. I hope that solves the mystery, and reassures the customer that it won't happen again on future upgrades. |