Bug 1986998

Summary:	On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes
Product:	Red Hat Enterprise Linux 8	Reporter:	Michele Baldessari <michele>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	8.4	CC:	cfeist, cluster-maint, eolivare, jeckersb, lmiccini, msmazova, phagara, pkomarov
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	8.5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.1.0-6.el8	Doc Type:	Bug Fix
Doc Text:	Cause: If the DC and another node leave the cluster at the same time, either node might be listed first in the notification from Corosync, and Pacemaker will process them in order. Consequence: If the non-DC node is listed and processed first, its transient node attributes will not be cleared, leading to potential problems with resource agents or unfencing. Fix: Pacemaker now sorts the Corosync notification so that the DC node is always first. Result: Transient attributes are properly cleared when a node leaves the cluster, even if the DC leaves at the same time.	Story Points:	---
Clone Of:
Clones:	1989292 1989622 (view as bug list)		Environment:
Last Closed:	2021-11-09 18:44:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	2.1.2
Embargoed:
Bug Depends On:
Bug Blocks:	1983952, 1989292, 1989622

Comment 1 Michele Baldessari 2021-07-29 07:39:48 UTC

*** Bug 1859971 has been marked as a duplicate of this bug. ***

Comment 2 Ken Gaillot 2021-08-02 15:57:17 UTC

FYI, crm_attribute with transient attributes, and attrd_updater, return success as long as the request can be submitted to the attribute manager (pacemaker-attrd). There is an outstanding RFE (Bug 1463033) to support an option to wait for a particular synchronization point before returning success. These points would be something like: the request was accepted (current behavior); the request has completed in local memory; the request has been recorded in the local CIB; the request has been relayed to all nodes; all nodes have reported the change completed in local memory; all nodes have reported the change completed in the local CIB.

That is not the problem here (which still needs investigation) but it does complicate trying to confirm changes immediately after submitting them.

Comment 3 Ken Gaillot 2021-08-02 17:34:52 UTC

You're a little too quick. This is an occurrence of an issue I ran into myself a couple months ago and was hoping to knock out for 8.6.

When a non-DC node leaves the cluster, the DC clears its transient attributes. If the DC leaves the cluster, all nodes clear the DC's transient attributes.

The problem can occur if both the DC and another node leave at the same time. Pacemaker processes the node exit notification list from Corosync one by one. If a non-DC node happens to be listed before the DC node, Pacemaker on the surviving node(s) will process the non-DC node exit first, and won't be aware yet that the DC has left, so it will assume the DC is handling the clearing for that node.

The fix should be straightforward, we need to sort the exit list so the DC is always first if present.

However, I'd expect "crm_attribute --verbose -N $NODENAME -l reboot -D --name rmq-node-attr-rabbitmq" (whether by the agent or manually) to be a sufficient workaround. I'm not sure why that didn't work. I don't see any attempt to clear it in the logs, though. I also don't know why changing the bundle configuration would help. If you knew the time you attempted the manual crm_attribute, I could check the logs around then.

Will you need an 8.4.z? If so, we'll probably need to get an exception to get it into 8.5.

Comment 4 Michele Baldessari 2021-08-02 18:00:15 UTC

(In reply to Ken Gaillot from comment #3)
> You're a little too quick. This is an occurrence of an issue I ran into
> myself a couple months ago and was hoping to knock out for 8.6.

Aha, good to know!
 
> When a non-DC node leaves the cluster, the DC clears its transient
> attributes. If the DC leaves the cluster, all nodes clear the DC's transient
> attributes.
> 
> The problem can occur if both the DC and another node leave at the same
> time. Pacemaker processes the node exit notification list from Corosync one
> by one. If a non-DC node happens to be listed before the DC node, Pacemaker
> on the surviving node(s) will process the non-DC node exit first, and won't
> be aware yet that the DC has left, so it will assume the DC is handling the
> clearing for that node.
> 
> The fix should be straightforward, we need to sort the exit list so the DC
> is always first if present.
> 
> However, I'd expect "crm_attribute --verbose -N $NODENAME -l reboot -D
> --name rmq-node-attr-rabbitmq" (whether by the agent or manually) to be a
> sufficient workaround. I'm not sure why that didn't work. I don't see any
> attempt to clear it in the logs, though. I also don't know why changing the
> bundle configuration would help. If you knew the time you attempted the
> manual crm_attribute, I could check the logs around then.

We do not have the exact timing related to these sosreports anymore. But I think we can try and
infer it a bit. Hear me out:
- We have the following lines:

Jul 27 20:03:39 controller-0 rabbitmq-cluster(rabbitmq)[34486]: INFO: RabbitMQ server could not get cluster status from mnesia
Jul 27 20:03:52 controller-0 rabbitmq-cluster(rabbitmq)[37019]: INFO: RabbitMQ server could not get cluster status from mnesia 


Which are here https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster.in#L287 and correspond to:
rmq_monitor() {
	local rc

	status=$($RMQ_EVAL 'rabbit_mnesia:cluster_status_from_mnesia().' 2>&1)
	if echo "${status}" | grep -q '^{ok'; then
		pcs_running=$(rmq_join_list | wc -w)
		ocf_log debug "Pacemaker thinks ${pcs_running} RabbitMQ nodes are running"
		rmq_running=$($RMQ_EVAL 'length(mnesia:system_info(running_db_nodes)).')
		ocf_log debug "RabbitMQ thinks ${rmq_running} RabbitMQ nodes are running"

		if [ $(( $rmq_running * 2 )) -lt $pcs_running ]; then
			ocf_log info "RabbitMQ is a minority partition, failing monitor"
			rmq_delete_nodename
			return $OCF_ERR_GENERIC
		fi

		ocf_log debug "RabbitMQ server is running normally"
		rmq_write_nodename

		return $OCF_SUCCESS
	else
		ocf_log info "RabbitMQ server could not get cluster status from mnesia"
		ocf_log debug "${status}"
		rmq_delete_nodename
		return $OCF_NOT_RUNNING
	fi
}

So right after we print 'RabbitMQ server could not get cluster status from mnesia' we call rmq_delete_nodename() which does:
${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name "$RMQ_CRM_ATTR_COOKIE" -D

So I'd say at 20:03:39 and 20:03:52 we did call the above and it did not remove anything (as we subsequently rechecked by adding some logging in the RA)

Does that help?


> Will you need an 8.4.z? If so, we'll probably need to get an exception to
> get it into 8.5.

Yeah. OSP 16.2 is stuck on rhel 8.4 forever :/

Comment 5 Ken Gaillot 2021-08-02 20:29:55 UTC

(In reply to Michele Baldessari from comment #4)
> So right after we print 'RabbitMQ server could not get cluster status from
> mnesia' we call rmq_delete_nodename() which does:
> ${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name
> "$RMQ_CRM_ATTR_COOKIE" -D
> 
> So I'd say at 20:03:39 and 20:03:52 we did call the above and it did not
> remove anything (as we subsequently rechecked by adding some logging in the
> RA)

I suspect this is what happened:

1. The cluster failed to clear the attributes from the CIB when the nodes left. However, the attributes were successfully cleared from the surviving node's memory.

2. When the agent or manual command tries to delete the attribute, the attribute manager doesn't see anything in memory to delete, and so does nothing.

If that's correct, a manual workaround should be to *set* the attribute to any value, *then* delete it. Another possibility might be to manually run "attrd_updater --refresh" to force the attribute manager to write out all attributes, but I'm not sure that would fix the issue.

> 
> Does that help?
> 
> 
> > Will you need an 8.4.z? If so, we'll probably need to get an exception to
> > get it into 8.5.
> 
> Yeah. OSP 16.2 is stuck on rhel 8.4 forever :/

Comment 8 Ken Gaillot 2021-08-10 16:34:02 UTC

Fixed upstream as of commit ee7eba6

Comment 9 Ken Gaillot 2021-08-11 16:26:47 UTC

QA: RHOSP QA will test the 8.4.z equivalent of this bz, so this bz can be tested for regressions only.

If you do want to reproduce it, it's straightforward:
1. Configure a cluster with at least 5 nodes (so quorum is retained if 2 are lost).
2. Choose two nodes: the DC node and a node with a lower Corosync node ID (if the DC has the lowest ID, just restart the cluster on that node, and another node will be elected DC).
3. Set a transient attribute on the non-DC node that was selected.
4. Kill both the nodes, and wait for the cluster to fence them.

Before this change, the transient attribute on the non-DC node will persist across the reboot of that node. After this change, it will not.

Comment 13 Michele Baldessari 2021-08-19 08:24:06 UTC

*** Bug 1887606 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2021-11-09 18:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267