1419548 – premature standby status reported by pcs while standby is in progress

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1419548 - premature standby status reported by pcs while standby is in progress

Summary: premature standby status reported by pcs while standby is in progress

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.7
Assignee:	Klaus Wenninger
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1420851
TreeView+	depends on / blocked

Reported:	2017-02-06 13:23 UTC by Josef Zimek
Modified:	2020-03-05 11:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-1.1.20-1.el7
Doc Type:	No Doc Update
Doc Text:	Minor change
Clone Of:
Clones:	1619253 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:53:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3336721	0	None	None	None	2018-08-06 17:34:48 UTC
Red Hat Product Errata	RHBA-2019:2129	0	None	None	None	2019-08-06 12:54:08 UTC

Description Josef Zimek 2017-02-06 13:23:42 UTC

Description of problem:


After `pcs cluster standby <node>` pcs status reports node as standby despite the fact that some resources are still stopping. This may lead to situation where node is reported as standby and later gets fenced i.e. due to failed filesystem unmount with on-fail=fence. In such a case the pcs status is unreliable because user may start with other activities such as system update because pcs status wrongly informed about current state. 

Having separate status for standby in progress (pending-standby) would help to better identify actual state of cluster.


Version-Release number of selected component (if applicable):

pacemaker-1.1.15-11.el7_3.2.x86_64




Steps to Reproduce:

1. cluster with filesystem resource with on-fail=fence
2. make fs busy i.e. w/ dd copy
3. standby node and check status with `pcs status`
4. `pcs status` reports node as standby but filesystem unmount will fail and node will get fenced subsequently




Actual results:

`pcs status` reports node is in standby mode despite some of resources still stopping



Expected results:

`pcs status` should return pending-standby or similar if not all actions to complete standby are finished. Otherwise an operator will assume standby is achieved and commence work incompatible with a fencing situation (such as patching).


Additional info:

Comment 1 Ken Gaillot 2017-02-06 16:01:14 UTC

From the cluster's point of view, the node *is* in standby (not in the process of going to standby), and moving the resources is the appropriate *reaction* to the node being in standby (not a direct part of being in standby).

Have you tried "pcs node standby --wait"? I believe it should already do what you want.

Comment 4 Ken Gaillot 2017-02-09 16:44:28 UTC

I think the closest we could come to this, would be to list a node as "standby with active resources".

I know from a human perspective, going into standby seems like a process that completes when resources are no longer running on the node. But that's too fuzzy for the code. For example, it is possible to configure recurring monitors with role=Stopped. Such a monitor allows the cluster to detect a service that is running somewhere it isn't supposed to be, and stop it. If a node is put into standby, normal monitors will no longer run, but role=Stopped monitors will. So to the cluster, moving resources off the node isn't something that just happens once, when standby is initiated. It is something that is enforced the entire time the node is in standby. At no point does the cluster consider that process "done". It simply sees the node in standby state, and adjusts services appropriately.

To clarify, the default wait time for pcs resource --wait is 60 minutes. That is just to prevent an infinite wait if the cluster is in some buggy loop. In practice, it converges quickly. The --wait option does not look for anything in particular; it simply waits until the cluster no longer has actions to perform, so it is really useful in this kind of "fuzzy" situation.

If "standby with active resources" would be useful, we can look into doing that.

Comment 5 John Ruemker 2017-07-12 20:20:53 UTC

I'm having trouble understanding what the actual impact is here, as I'm trying to consider whether our customers need a high priority in RHEL 7 Update 5 to have this fixed.

Josef: Your description - as well as Mohammad's - highlight the situation where the node gets fenced as the context in which there is concern over this behavior.  Why does the node getting fenced create a problem?  If it was fenced, shouldn't its resources be in an inaccessible state, and thus the standby state you see in 'pcs status' is safe to rely on?

Comment 6 Ken Gaillot 2017-07-14 15:58:15 UTC

John: I believe the concern is that when an administrator sees the node status as "standby", they may incorrectly assume that all resources are already stopped, and begin doing maintenance that then leads to the node being fenced (which would not happen if they waited until the resources were stopped).

Something to indicate "standby with active resources" might make it more obvious that they have to wait.

Comment 7 michal novacek 2017-08-04 10:27:11 UTC

qa-ack+: node put to standby with resources still running will be marked as
'standby with active resources', reproducer in the initial comment

Comment 8 Ken Gaillot 2017-08-29 21:20:15 UTC

Due to a short time frame and limited capacity, this will not make 7.5.

Comment 9 Klaus Wenninger 2018-08-20 13:07:50 UTC

This is cloned as bz1619253 against pcs.
As preconditions for pcs to implement the fix are already met from pacemaker-side this bz doesn't block bz1619253.
Further on this bz just deals with crm_mon to distinguish between nodes that are in standby with resources still running on them and those which already have managed to stop all resources.

Comment 11 Klaus Wenninger 2018-12-20 14:34:18 UTC

Fix is available already upstream

pacemaker-1.1:

commit 1d8cadce89c8c023b003b75a88aca061bdf2d7df
Author: Klaus Wenninger <klaus.wenninger>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources


pacemaker-2:

commit d967942723776dbdf79184f12a688028990bfeaf
Author: Klaus Wenninger <klaus.wenninger>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources

Comment 13 Patrik Hagara 2019-03-20 12:57:54 UTC

qa-ack+

reproducer in comment#0 and comment#7

Comment 17 errata-xmlrpc 2019-08-06 12:53:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129

Note You need to log in before you can comment on or make changes to this bug.