Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1327677

Summary:	Pacemaker node in standby state is fenced during cluster stop operation
Product:	Red Hat Enterprise Linux 7	Reporter:	Matt Flusche <mflusche>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED CURRENTRELEASE	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	abeekhof, cluster-maint
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-16 22:12:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Flusche 2016-04-15 15:33:50 UTC

Description of problem:
I'm trying to understand why this fence event occurred (Apr 13 16:31:22 -- log time).

The pacemaker node is in standby mode and is fenced during cluster stop operation.

The fenced node is running: pacemaker-1.1.12-22.el7_1.2.x86_64

It was being stopped to apply patches including upgrading to: pacemaker-1.1.13-10.el7_2.2.x86_64

pcmk-ocd2c003 is the node that was fenced; this node was also the active DC.

pcmk-ocd2c003 was put into standby and then we waited until all resources were in stopped state.

A few seconds after a cluster stop is executed on pcmk-ocd2c003 it is fenced.

pcmk-ocd2c002 becomes the active DC.  In the log I see that it reports pcmk-ocd2c0003 as unclean and then fences it.  It also appear from the log that the new DC believe there are still active resources on pcmk-ocd2c0003.

I will link pacemaker debug logs from pcmk-ocd2c0002 and pcmk-ocd2c0002 for review.

Thanks for your help.

Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.2.x86_64
Cluster being upgraded to 1.1.13-10.el7_2.2.x86_64

How reproducible:
We attempted to reproduce but were unable


Steps to Reproduce:
1.
2.
3.

Actual results:
cluster node is fenced


Expected results:
cluster node shuts down without being fenced

Additional info:

Will provide link to debug logs

Comment 4 Andrew Beekhof 2016-04-19 04:29:12 UTC

Could I get a copy of this file please?

Apr 13 16:31:22 [3927] ocd2c002.osinst.net    pengine:  warning: process_pe_message:	Calculated Transition 0: /var/lib/pacemaker/pengine/pe-warn-13.bz2

Comment 5 Matt Flusche 2016-04-19 13:08:45 UTC

I have requested the file from the customer.

Comment 7 Andrew Beekhof 2016-04-20 01:03:43 UTC

Ken:

The old DC correctly indicates that it intends to leave the cluster (note the origin):

Apr 13 16:31:17 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: --- 0.451.105 2
Apr 13 16:31:17 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: +++ 0.451.106 (null)
Apr 13 16:31:17 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib:  @num_updates=106
Apr 13 16:31:17 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib/status/node_state[@id='3']:  @crm-debug-origin=do_dc_release, @expected=down
Apr 13 16:31:17 [3923] ocd2c002.osinst.net        cib:     info: cib_process_request:	Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c003/crmd/3142, version=0.451.106)

And the new DC correctly notices and updates the crmd status:

Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: --- 0.452.15 2
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: +++ 0.452.16 (null)
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib:  @num_updates=16
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib/status/node_state[@id='3']:  @crmd=offline, @crm-debug-origin=peer_update_callback
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_process_request:	Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/620, version=0.452.16)


However it then goes and overwrites the expected status:

Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: --- 0.452.16 2
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	Diff: +++ 0.452.17 (null)
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib:  @num_updates=17
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_perform_op:	+  /cib/status/node_state[@id='3']:  @in_ccm=false, @crm-debug-origin=post_cache_update, @expected=member
Apr 13 16:31:20 [3923] ocd2c002.osinst.net        cib:     info: cib_process_request:	Completed cib_modify operation for section status: OK (rc=0, origin=pcmk-ocd2c002/crmd/625, version=0.452.17)

This matches the crmd's internal view of the state:

Apr 13 16:31:17 [3928] ocd2c002.osinst.net       crmd:     info: crm_update_peer_expected:	update_dc: Node pcmk-ocd2c002[1] - expected state is now member (was (null))
Apr 13 16:31:17 [3928] ocd2c002.osinst.net       crmd:     info: crm_update_peer_expected:	do_dc_join_filter_offer: Node pcmk-ocd2c004[2] - expected state is now member (was (null))

However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and received on ocd2c003):

Apr 13 16:31:16 [3447] ocd2c003.osinst.net       crmd:     info: do_shutdown_req: 	Sending shutdown request to pcmk-ocd2c003
Apr 13 16:31:16 [3447] ocd2c003.osinst.net       crmd:     info: handle_shutdown_request: 	Creating shutdown request for pcmk-ocd2c003 (state=S_POLICY_ENGINE)

That should mean it was/will be received on the other nodes too.
However there is no equivalent of:

Apr 13 16:31:17 [3447] ocd2c003.osinst.net       crmd:     info: crm_update_peer_expected: 	do_dc_release: Node pcmk-ocd2c003[3] - expected state is now down (was member)

I suspect that corosync simply dropped the message (you'll want to talk about this with Chrissie), the question is - what should we do about it.

Possibly we need something like what the cib has at shutdown, where all the peers need to ack it (which allows us to assume that all messages have arrvied) before it will go away.

Comment 8 Ken Gaillot 2016-07-05 21:01:19 UTC

Capacity constrained, moving to 7.4

Comment 9 Ken Gaillot 2016-08-16 22:12:53 UTC

(In reply to Andrew Beekhof from comment #7)
> However you can see that the CRM_OP_SHUTDOWN_REQ message was sent (and
> received on ocd2c003):
> 
> Apr 13 16:31:16 [3447] ocd2c003.osinst.net       crmd:     info:
> do_shutdown_req: 	Sending shutdown request to pcmk-ocd2c003
> Apr 13 16:31:16 [3447] ocd2c003.osinst.net       crmd:     info:
> handle_shutdown_request: 	Creating shutdown request for pcmk-ocd2c003
> (state=S_POLICY_ENGINE)
> 
> That should mean it was/will be received on the other nodes too.
> However there is no equivalent of:
> 
> Apr 13 16:31:17 [3447] ocd2c003.osinst.net       crmd:     info:
> crm_update_peer_expected: 	do_dc_release: Node pcmk-ocd2c003[3] - expected
> state is now down (was member)
> 
> I suspect that corosync simply dropped the message (you'll want to talk
> about this with Chrissie), the question is - what should we do about it.

Nope, it's simpler than that: the old DC was running pacemaker 1.1.12-22.el7_1.2, which didn't send shutdown requests to all peers. The possibility of this issue here is why that change was made in pacemaker 1.1.13 and z-streamed to RHEL 7.1 as part of 1.1.12-22.el7_1.3.

I hope that solves the mystery, and reassures the customer that it won't happen again on future upgrades.