1986998 – On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1986998 - On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes

Summary: On a three-node cluster if two nodes are hard-reset, sometimes the cluster en...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.5
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1859971 1887606 (view as bug list)
Depends On:
Blocks:	1983952 1989292 1989622
TreeView+	depends on / blocked

Reported:	2021-07-28 16:51 UTC by Michele Baldessari
Modified:	2021-11-10 01:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-2.1.0-6.el8
Doc Type:	Bug Fix
Doc Text:	Cause: If the DC and another node leave the cluster at the same time, either node might be listed first in the notification from Corosync, and Pacemaker will process them in order. Consequence: If the non-DC node is listed and processed first, its transient node attributes will not be cleared, leading to potential problems with resource agents or unfencing. Fix: Pacemaker now sorts the Corosync notification so that the DC node is always first. Result: Transient attributes are properly cleared when a node leaves the cluster, even if the DC leaves at the same time.
Clone Of:
Clones:	1989292 1989622 (view as bug list)
Environment:
Last Closed:	2021-11-09 18:44:54 UTC
Type:	Bug
Target Upstream Version:	2.1.2
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2021:4267	0	None	None	None	2021-11-09 18:45:16 UTC

Comment 1 Michele Baldessari 2021-07-29 07:39:48 UTC

*** Bug 1859971 has been marked as a duplicate of this bug. ***

Comment 2 Ken Gaillot 2021-08-02 15:57:17 UTC

FYI, crm_attribute with transient attributes, and attrd_updater, return success as long as the request can be submitted to the attribute manager (pacemaker-attrd). There is an outstanding RFE (Bug 1463033) to support an option to wait for a particular synchronization point before returning success. These points would be something like: the request was accepted (current behavior); the request has completed in local memory; the request has been recorded in the local CIB; the request has been relayed to all nodes; all nodes have reported the change completed in local memory; all nodes have reported the change completed in the local CIB.

That is not the problem here (which still needs investigation) but it does complicate trying to confirm changes immediately after submitting them.

Comment 3 Ken Gaillot 2021-08-02 17:34:52 UTC

You're a little too quick. This is an occurrence of an issue I ran into myself a couple months ago and was hoping to knock out for 8.6.

When a non-DC node leaves the cluster, the DC clears its transient attributes. If the DC leaves the cluster, all nodes clear the DC's transient attributes.

The problem can occur if both the DC and another node leave at the same time. Pacemaker processes the node exit notification list from Corosync one by one. If a non-DC node happens to be listed before the DC node, Pacemaker on the surviving node(s) will process the non-DC node exit first, and won't be aware yet that the DC has left, so it will assume the DC is handling the clearing for that node.

The fix should be straightforward, we need to sort the exit list so the DC is always first if present.

However, I'd expect "crm_attribute --verbose -N $NODENAME -l reboot -D --name rmq-node-attr-rabbitmq" (whether by the agent or manually) to be a sufficient workaround. I'm not sure why that didn't work. I don't see any attempt to clear it in the logs, though. I also don't know why changing the bundle configuration would help. If you knew the time you attempted the manual crm_attribute, I could check the logs around then.

Will you need an 8.4.z? If so, we'll probably need to get an exception to get it into 8.5.

Comment 4 Michele Baldessari 2021-08-02 18:00:15 UTC

(In reply to Ken Gaillot from comment #3)
> You're a little too quick. This is an occurrence of an issue I ran into
> myself a couple months ago and was hoping to knock out for 8.6.

Aha, good to know!
 
> When a non-DC node leaves the cluster, the DC clears its transient
> attributes. If the DC leaves the cluster, all nodes clear the DC's transient
> attributes.
> 
> The problem can occur if both the DC and another node leave at the same
> time. Pacemaker processes the node exit notification list from Corosync one
> by one. If a non-DC node happens to be listed before the DC node, Pacemaker
> on the surviving node(s) will process the non-DC node exit first, and won't
> be aware yet that the DC has left, so it will assume the DC is handling the
> clearing for that node.
> 
> The fix should be straightforward, we need to sort the exit list so the DC
> is always first if present.
> 
> However, I'd expect "crm_attribute --verbose -N $NODENAME -l reboot -D
> --name rmq-node-attr-rabbitmq" (whether by the agent or manually) to be a
> sufficient workaround. I'm not sure why that didn't work. I don't see any
> attempt to clear it in the logs, though. I also don't know why changing the
> bundle configuration would help. If you knew the time you attempted the
> manual crm_attribute, I could check the logs around then.

We do not have the exact timing related to these sosreports anymore. But I think we can try and
infer it a bit. Hear me out:
- We have the following lines:

Jul 27 20:03:39 controller-0 rabbitmq-cluster(rabbitmq)[34486]: INFO: RabbitMQ server could not get cluster status from mnesia
Jul 27 20:03:52 controller-0 rabbitmq-cluster(rabbitmq)[37019]: INFO: RabbitMQ server could not get cluster status from mnesia 


Which are here https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster.in#L287 and correspond to:
rmq_monitor() {
	local rc

	status=$($RMQ_EVAL 'rabbit_mnesia:cluster_status_from_mnesia().' 2>&1)
	if echo "${status}" | grep -q '^{ok'; then
		pcs_running=$(rmq_join_list | wc -w)
		ocf_log debug "Pacemaker thinks ${pcs_running} RabbitMQ nodes are running"
		rmq_running=$($RMQ_EVAL 'length(mnesia:system_info(running_db_nodes)).')
		ocf_log debug "RabbitMQ thinks ${rmq_running} RabbitMQ nodes are running"

		if [ $(( $rmq_running * 2 )) -lt $pcs_running ]; then
			ocf_log info "RabbitMQ is a minority partition, failing monitor"
			rmq_delete_nodename
			return $OCF_ERR_GENERIC
		fi

		ocf_log debug "RabbitMQ server is running normally"
		rmq_write_nodename

		return $OCF_SUCCESS
	else
		ocf_log info "RabbitMQ server could not get cluster status from mnesia"
		ocf_log debug "${status}"
		rmq_delete_nodename
		return $OCF_NOT_RUNNING
	fi
}

So right after we print 'RabbitMQ server could not get cluster status from mnesia' we call rmq_delete_nodename() which does:
${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name "$RMQ_CRM_ATTR_COOKIE" -D

So I'd say at 20:03:39 and 20:03:52 we did call the above and it did not remove anything (as we subsequently rechecked by adding some logging in the RA)

Does that help?


> Will you need an 8.4.z? If so, we'll probably need to get an exception to
> get it into 8.5.

Yeah. OSP 16.2 is stuck on rhel 8.4 forever :/

Comment 5 Ken Gaillot 2021-08-02 20:29:55 UTC

(In reply to Michele Baldessari from comment #4)
> So right after we print 'RabbitMQ server could not get cluster status from
> mnesia' we call rmq_delete_nodename() which does:
> ${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name
> "$RMQ_CRM_ATTR_COOKIE" -D
> 
> So I'd say at 20:03:39 and 20:03:52 we did call the above and it did not
> remove anything (as we subsequently rechecked by adding some logging in the
> RA)

I suspect this is what happened:

1. The cluster failed to clear the attributes from the CIB when the nodes left. However, the attributes were successfully cleared from the surviving node's memory.

2. When the agent or manual command tries to delete the attribute, the attribute manager doesn't see anything in memory to delete, and so does nothing.

If that's correct, a manual workaround should be to *set* the attribute to any value, *then* delete it. Another possibility might be to manually run "attrd_updater --refresh" to force the attribute manager to write out all attributes, but I'm not sure that would fix the issue.

> 
> Does that help?
> 
> 
> > Will you need an 8.4.z? If so, we'll probably need to get an exception to
> > get it into 8.5.
> 
> Yeah. OSP 16.2 is stuck on rhel 8.4 forever :/

Comment 8 Ken Gaillot 2021-08-10 16:34:02 UTC

Fixed upstream as of commit ee7eba6

Comment 9 Ken Gaillot 2021-08-11 16:26:47 UTC

QA: RHOSP QA will test the 8.4.z equivalent of this bz, so this bz can be tested for regressions only.

If you do want to reproduce it, it's straightforward:
1. Configure a cluster with at least 5 nodes (so quorum is retained if 2 are lost).
2. Choose two nodes: the DC node and a node with a lower Corosync node ID (if the DC has the lowest ID, just restart the cluster on that node, and another node will be elected DC).
3. Set a transient attribute on the non-DC node that was selected.
4. Kill both the nodes, and wait for the cluster to fence them.

Before this change, the transient attribute on the non-DC node will persist across the reboot of that node. After this change, it will not.

Comment 13 Michele Baldessari 2021-08-19 08:24:06 UTC

*** Bug 1887606 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2021-11-09 18:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267

Note You need to log in before you can comment on or make changes to this bug.