Bug 1419548 - premature standby status reported by pcs while standby is in progress
Summary: premature standby status reported by pcs while standby is in progress
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 7.7
Assignee: Klaus Wenninger
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1420851
TreeView+ depends on / blocked
 
Reported: 2017-02-06 13:23 UTC by Josef Zimek
Modified: 2020-03-05 11:32 UTC (History)
9 users (show)

Fixed In Version: pacemaker-1.1.20-1.el7
Doc Type: No Doc Update
Doc Text:
Minor change
Clone Of:
: 1619253 (view as bug list)
Environment:
Last Closed: 2019-08-06 12:53:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3336721 0 None None None 2018-08-06 17:34:48 UTC
Red Hat Product Errata RHBA-2019:2129 0 None None None 2019-08-06 12:54:08 UTC

Description Josef Zimek 2017-02-06 13:23:42 UTC
Description of problem:


After `pcs cluster standby <node>` pcs status reports node as standby despite the fact that some resources are still stopping. This may lead to situation where node is reported as standby and later gets fenced i.e. due to failed filesystem unmount with on-fail=fence. In such a case the pcs status is unreliable because user may start with other activities such as system update because pcs status wrongly informed about current state. 

Having separate status for standby in progress (pending-standby) would help to better identify actual state of cluster.


Version-Release number of selected component (if applicable):

pacemaker-1.1.15-11.el7_3.2.x86_64




Steps to Reproduce:

1. cluster with filesystem resource with on-fail=fence
2. make fs busy i.e. w/ dd copy
3. standby node and check status with `pcs status`
4. `pcs status` reports node as standby but filesystem unmount will fail and node will get fenced subsequently




Actual results:

`pcs status` reports node is in standby mode despite some of resources still stopping



Expected results:

`pcs status` should return pending-standby or similar if not all actions to complete standby are finished. Otherwise an operator will assume standby is achieved and commence work incompatible with a fencing situation (such as patching).


Additional info:

Comment 1 Ken Gaillot 2017-02-06 16:01:14 UTC
From the cluster's point of view, the node *is* in standby (not in the process of going to standby), and moving the resources is the appropriate *reaction* to the node being in standby (not a direct part of being in standby).

Have you tried "pcs node standby --wait"? I believe it should already do what you want.

Comment 4 Ken Gaillot 2017-02-09 16:44:28 UTC
I think the closest we could come to this, would be to list a node as "standby with active resources".

I know from a human perspective, going into standby seems like a process that completes when resources are no longer running on the node. But that's too fuzzy for the code. For example, it is possible to configure recurring monitors with role=Stopped. Such a monitor allows the cluster to detect a service that is running somewhere it isn't supposed to be, and stop it. If a node is put into standby, normal monitors will no longer run, but role=Stopped monitors will. So to the cluster, moving resources off the node isn't something that just happens once, when standby is initiated. It is something that is enforced the entire time the node is in standby. At no point does the cluster consider that process "done". It simply sees the node in standby state, and adjusts services appropriately.

To clarify, the default wait time for pcs resource --wait is 60 minutes. That is just to prevent an infinite wait if the cluster is in some buggy loop. In practice, it converges quickly. The --wait option does not look for anything in particular; it simply waits until the cluster no longer has actions to perform, so it is really useful in this kind of "fuzzy" situation.

If "standby with active resources" would be useful, we can look into doing that.

Comment 5 John Ruemker 2017-07-12 20:20:53 UTC
I'm having trouble understanding what the actual impact is here, as I'm trying to consider whether our customers need a high priority in RHEL 7 Update 5 to have this fixed.

Josef: Your description - as well as Mohammad's - highlight the situation where the node gets fenced as the context in which there is concern over this behavior.  Why does the node getting fenced create a problem?  If it was fenced, shouldn't its resources be in an inaccessible state, and thus the standby state you see in 'pcs status' is safe to rely on?

Comment 6 Ken Gaillot 2017-07-14 15:58:15 UTC
John: I believe the concern is that when an administrator sees the node status as "standby", they may incorrectly assume that all resources are already stopped, and begin doing maintenance that then leads to the node being fenced (which would not happen if they waited until the resources were stopped).

Something to indicate "standby with active resources" might make it more obvious that they have to wait.

Comment 7 michal novacek 2017-08-04 10:27:11 UTC
qa-ack+: node put to standby with resources still running will be marked as
'standby with active resources', reproducer in the initial comment

Comment 8 Ken Gaillot 2017-08-29 21:20:15 UTC
Due to a short time frame and limited capacity, this will not make 7.5.

Comment 9 Klaus Wenninger 2018-08-20 13:07:50 UTC
This is cloned as bz1619253 against pcs.
As preconditions for pcs to implement the fix are already met from pacemaker-side this bz doesn't block bz1619253.
Further on this bz just deals with crm_mon to distinguish between nodes that are in standby with resources still running on them and those which already have managed to stop all resources.

Comment 11 Klaus Wenninger 2018-12-20 14:34:18 UTC
Fix is available already upstream

pacemaker-1.1:

commit 1d8cadce89c8c023b003b75a88aca061bdf2d7df
Author: Klaus Wenninger <klaus.wenninger@aon.at>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources


pacemaker-2:

commit d967942723776dbdf79184f12a688028990bfeaf
Author: Klaus Wenninger <klaus.wenninger@aon.at>
Date:   Fri Aug 24 14:14:54 2018 +0200

    Fix: crm_mon: rhbz1419548: show standby-node with active resources

Comment 13 Patrik Hagara 2019-03-20 12:57:54 UTC
qa-ack+

reproducer in comment#0 and comment#7

Comment 17 errata-xmlrpc 2019-08-06 12:53:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129


Note You need to log in before you can comment on or make changes to this bug.