Bug 1419548

Summary:	premature standby status reported by pcs while standby is in progress
Product:	Red Hat Enterprise Linux 7	Reporter:	Josef Zimek <pzimek>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	abeekhof, cluster-maint, jruemker, kgaillot, mnovacek, nhostako, phagara, pzimek, sbradley
Target Milestone:	rc
Target Release:	7.7
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-1.1.20-1.el7	Doc Type:	No Doc Update
Doc Text:	Minor change	Story Points:	---
Clone Of:
Clones:	1619253 (view as bug list)		Environment:
Last Closed:	2019-08-06 12:53:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420851

Description Josef Zimek 2017-02-06 13:23:42 UTC

Description of problem:


After `pcs cluster standby <node>` pcs status reports node as standby despite the fact that some resources are still stopping. This may lead to situation where node is reported as standby and later gets fenced i.e. due to failed filesystem unmount with on-fail=fence. In such a case the pcs status is unreliable because user may start with other activities such as system update because pcs status wrongly informed about current state. 

Having separate status for standby in progress (pending-standby) would help to better identify actual state of cluster.


Version-Release number of selected component (if applicable):

pacemaker-1.1.15-11.el7_3.2.x86_64




Steps to Reproduce:

1. cluster with filesystem resource with on-fail=fence
2. make fs busy i.e. w/ dd copy
3. standby node and check status with `pcs status`
4. `pcs status` reports node as standby but filesystem unmount will fail and node will get fenced subsequently




Actual results:

`pcs status` reports node is in standby mode despite some of resources still stopping



Expected results:

`pcs status` should return pending-standby or similar if not all actions to complete standby are finished. Otherwise an operator will assume standby is achieved and commence work incompatible with a fencing situation (such as patching).


Additional info:

Comment 1 Ken Gaillot 2017-02-06 16:01:14 UTC

From the cluster's point of view, the node *is* in standby (not in the process of going to standby), and moving the resources is the appropriate *reaction* to the node being in standby (not a direct part of being in standby).

Have you tried "pcs node standby --wait"? I believe it should already do what you want.

Comment 4 Ken Gaillot 2017-02-09 16:44:28 UTC

I think the closest we could come to this, would be to list a node as "standby with active resources".

I know from a human perspective, going into standby seems like a process that completes when resources are no longer running on the node. But that's too fuzzy for the code. For example, it is possible to configure recurring monitors with role=Stopped. Such a monitor allows the cluster to detect a service that is running somewhere it isn't supposed to be, and stop it. If a node is put into standby, normal monitors will no longer run, but role=Stopped monitors will. So to the cluster, moving resources off the node isn't something that just happens once, when standby is initiated. It is something that is enforced the entire time the node is in standby. At no point does the cluster consider that process "done". It simply sees the node in standby state, and adjusts services appropriately.

To clarify, the default wait time for pcs resource --wait is 60 minutes. That is just to prevent an infinite wait if the cluster is in some buggy loop. In practice, it converges quickly. The --wait option does not look for anything in particular; it simply waits until the cluster no longer has actions to perform, so it is really useful in this kind of "fuzzy" situation.

If "standby with active resources" would be useful, we can look into doing that.

Comment 5 John Ruemker 2017-07-12 20:20:53 UTC

I'm having trouble understanding what the actual impact is here, as I'm trying to consider whether our customers need a high priority in RHEL 7 Update 5 to have this fixed.

Josef: Your description - as well as Mohammad's - highlight the situation where the node gets fenced as the context in which there is concern over this behavior.  Why does the node getting fenced create a problem?  If it was fenced, shouldn't its resources be in an inaccessible state, and thus the standby state you see in 'pcs status' is safe to rely on?

Comment 6 Ken Gaillot 2017-07-14 15:58:15 UTC

John: I believe the concern is that when an administrator sees the node status as "standby", they may incorrectly assume that all resources are already stopped, and begin doing maintenance that then leads to the node being fenced (which would not happen if they waited until the resources were stopped).

Something to indicate "standby with active resources" might make it more obvious that they have to wait.

Comment 7 michal novacek 2017-08-04 10:27:11 UTC

qa-ack+: node put to standby with resources still running will be marked as
'standby with active resources', reproducer in the initial comment

Comment 8 Ken Gaillot 2017-08-29 21:20:15 UTC

Due to a short time frame and limited capacity, this will not make 7.5.

Comment 9 Klaus Wenninger 2018-08-20 13:07:50 UTC

This is cloned as bz1619253 against pcs.
As preconditions for pcs to implement the fix are already met from pacemaker-side this bz doesn't block bz1619253.
Further on this bz just deals with crm_mon to distinguish between nodes that are in standby with resources still running on them and those which already have managed to stop all resources.

Comment 11 Klaus Wenninger 2018-12-20 14:34:18 UTC

Fix is available already upstream

pacemaker-1.1:

commit 1d8cadce89c8c023b003b75a88aca061bdf2d7df
Author: Klaus Wenninger <klaus.wenninger>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources


pacemaker-2:

commit d967942723776dbdf79184f12a688028990bfeaf
Author: Klaus Wenninger <klaus.wenninger>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources

Comment 13 Patrik Hagara 2019-03-20 12:57:54 UTC

qa-ack+

reproducer in comment#0 and comment#7

Comment 17 errata-xmlrpc 2019-08-06 12:53:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129