1464068 – [GANESHA] pcs status shows all nodes in started state for ~15 mins even when hit "partition WITHOUT quorum" with IO's still resuming

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1464068 - [GANESHA] pcs status shows all nodes in started state for ~15 mins even when hit "partition WITHOUT quorum" with IO's still resuming

Summary: [GANESHA] pcs status shows all nodes in started state for ~15 mins even when...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.5
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1463992 1481140
TreeView+	depends on / blocked

Reported:	2017-06-22 11:24 UTC by Manisha Saini
Modified:	2018-04-10 15:32 UTC (History)
CC List:	16 users (show)
Fixed In Version:	pacemaker-1.1.18-1.el7
Doc Type:	If docs needed, set a value
Doc Text:	Previously, quorum loss did not trigger Pacemaker to recheck resource placement. As a consequence, in certain situations Pacemaker required a long time, up to the cluster recheck interval, before stopping resources after quorum loss. This happened only when several conditions were met: a node that was correctly shutting down dropped the cluster below the quorum; that node was not running any resources at the time; and a cluster transition was already in progress. With this update, Pacemaker always cancels the current transition when quorum is lost and recalculates resource placement immediately. As a result, the long delay no longer occurs.
Clone Of:	1463992
Clones:	1481140 (view as bug list)
Environment:
Last Closed:	2018-04-10 15:30:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:0860	0	None	None	None	2018-04-10 15:32:11 UTC

Comment 4 Ken Gaillot 2017-06-23 17:07:42 UTC

This does appear to be a bug. Investigating whether it's a regression and how to fix it.

What appears to be happening is that quorum is lost while a transition is in progress (that is, the cluster is in the middle of executing a set of actions based on the previous state), and a new transition is not immediately triggered, as I would expect it should be.

Comment 5 Manisha Saini 2017-06-27 11:27:03 UTC


The issue is even observed in Rhel 7.3 and RHGS 3.2

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.3 (Maipo)


# rpm -qa | grep ganesha
nfs-ganesha-2.4.1-11.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-11.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-18.4.el7rhgs.x86_64

Comment 6 Ken Gaillot 2017-06-27 14:32:12 UTC

Preliminary investigation suggests that quorum loss due to a clean node shutdown has never triggered an immediate recheck of resource placement, though it clearly should. I'm not sure why this hasn't been more of an issue before, so I'm still investigating whether anything changed recently to make this more likely to have an effect.

We're past the deadlines to make it into 7.4 GA, but I will ask for a z-stream.

Comment 7 Ken Gaillot 2017-06-27 21:33:45 UTC

Fix is upstream as of commit 0b68905

The issue only occurs under a fairly narrow set of circumstances:
- A node cleanly shutting down drops the cluster below quorum
- The node was not running any resources at the time (e.g. it was in standby mode)
- A transition was in progress

Comment 18 Ken Gaillot 2017-08-15 15:19:24 UTC

Testing procedure:

1. Configure a cluster of at least three nodes, one dummy resource that takes a long time to stop, and at least one other resource.

2. Stop enough nodes so that the cluster is one node away from losing quorum.

3. Put one of the remaining nodes in standby, and wait until it has no resources running on it.

4. Disable the dummy resource so that it initiates a stop, and before it complete the stop, shut down the standby node.

Before the change, the cluster will not stop the remaining resource(s) on the active node(s) until the next cluster-recheck-interval. After the change, the cluster will immediate stop all remaining resources.

Comment 20 Patrik Hagara 2017-12-21 12:22:26 UTC

Unable to reproduce the issue using provided procedure (1.1.16-12.el7). I've set the cluster-recheck-interval attribute to 3600 seconds and created two ocf:pacemaker:Dummy resources, one of them has op_sleep attribute set to 60 seconds (and all operation timeouts adjusted to 90s). Both resources stop immediately upon quorum loss no matter how the standby node gets shut down (panic, clean system shutdown, pcs cluster stop). Is there any other condition required to trigger this bug?

Comment 21 Ken Gaillot 2017-12-21 21:59:52 UTC

(In reply to Patrik Hagara from comment #20)
> Unable to reproduce the issue using provided procedure (1.1.16-12.el7). I've
> set the cluster-recheck-interval attribute to 3600 seconds and created two
> ocf:pacemaker:Dummy resources, one of them has op_sleep attribute set to 60
> seconds (and all operation timeouts adjusted to 90s). Both resources stop
> immediately upon quorum loss no matter how the standby node gets shut down
> (panic, clean system shutdown, pcs cluster stop). Is there any other
> condition required to trigger this bug?

That's surprising, I thought this one was pretty reliable.

I'm guessing something else must be happening in your cluster at the same time as quorum loss. Can you attach logs? The only thing I can think of is to make sure record-pending=false (the default).

BTW the proper behavior is expected if the standby node is panicked and fenced. It's only a clean quorum loss that triggers the behavior.

Comment 23 Ken Gaillot 2018-01-05 15:35:16 UTC

I think I may have confused two bugs when describing the reproducer. Try it again without the slow resource.

Comment 24 Patrik Hagara 2018-01-15 14:43:23 UTC

Managed to reproduce with a 7-node cluster on 1.1.16-8.el7-94ff4df (stonith disabled, cluster-recheck-interval set to a high value and a single ocf:heartbeat:Dummy resource).

First clean node shutdown did not trigger the bug, the trick was to:
1) "pcs cluster stop" 3 out of 7 nodes
2) put another one into standby
3) cleanly shut down the standby node
4) "pcs cluster start" one of the stopped nodes
5) put that one into standby
6) and then cleanly shut it down

The dummy resource will stay in the "Started" role until the next cluster recheck.

Same steps on 1.1.18-9.el7-2b07d5c5a9 result in the dummy resource getting stopped immediately after quorum loss. Marking verified.

Comment 27 errata-xmlrpc 2018-04-10 15:30:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0860

Note You need to log in before you can comment on or make changes to this bug.