Bug 1419548
Summary: | premature standby status reported by pcs while standby is in progress | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Josef Zimek <pzimek> | |
Component: | pacemaker | Assignee: | Klaus Wenninger <kwenning> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.4 | CC: | abeekhof, cluster-maint, jruemker, kgaillot, mnovacek, nhostako, phagara, pzimek, sbradley | |
Target Milestone: | rc | |||
Target Release: | 7.7 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-1.1.20-1.el7 | Doc Type: | No Doc Update | |
Doc Text: |
Minor change
|
Story Points: | --- | |
Clone Of: | ||||
: | 1619253 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-06 12:53:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1420851 |
Description
Josef Zimek
2017-02-06 13:23:42 UTC
From the cluster's point of view, the node *is* in standby (not in the process of going to standby), and moving the resources is the appropriate *reaction* to the node being in standby (not a direct part of being in standby). Have you tried "pcs node standby --wait"? I believe it should already do what you want. I think the closest we could come to this, would be to list a node as "standby with active resources". I know from a human perspective, going into standby seems like a process that completes when resources are no longer running on the node. But that's too fuzzy for the code. For example, it is possible to configure recurring monitors with role=Stopped. Such a monitor allows the cluster to detect a service that is running somewhere it isn't supposed to be, and stop it. If a node is put into standby, normal monitors will no longer run, but role=Stopped monitors will. So to the cluster, moving resources off the node isn't something that just happens once, when standby is initiated. It is something that is enforced the entire time the node is in standby. At no point does the cluster consider that process "done". It simply sees the node in standby state, and adjusts services appropriately. To clarify, the default wait time for pcs resource --wait is 60 minutes. That is just to prevent an infinite wait if the cluster is in some buggy loop. In practice, it converges quickly. The --wait option does not look for anything in particular; it simply waits until the cluster no longer has actions to perform, so it is really useful in this kind of "fuzzy" situation. If "standby with active resources" would be useful, we can look into doing that. I'm having trouble understanding what the actual impact is here, as I'm trying to consider whether our customers need a high priority in RHEL 7 Update 5 to have this fixed. Josef: Your description - as well as Mohammad's - highlight the situation where the node gets fenced as the context in which there is concern over this behavior. Why does the node getting fenced create a problem? If it was fenced, shouldn't its resources be in an inaccessible state, and thus the standby state you see in 'pcs status' is safe to rely on? John: I believe the concern is that when an administrator sees the node status as "standby", they may incorrectly assume that all resources are already stopped, and begin doing maintenance that then leads to the node being fenced (which would not happen if they waited until the resources were stopped). Something to indicate "standby with active resources" might make it more obvious that they have to wait. qa-ack+: node put to standby with resources still running will be marked as 'standby with active resources', reproducer in the initial comment Due to a short time frame and limited capacity, this will not make 7.5. This is cloned as bz1619253 against pcs. As preconditions for pcs to implement the fix are already met from pacemaker-side this bz doesn't block bz1619253. Further on this bz just deals with crm_mon to distinguish between nodes that are in standby with resources still running on them and those which already have managed to stop all resources. Fix is available already upstream pacemaker-1.1: commit 1d8cadce89c8c023b003b75a88aca061bdf2d7df Author: Klaus Wenninger <klaus.wenninger> Date: Fri Aug 24 14:14:54 2018 +0200 Fix: crm_mon: rhbz1419548: show standby-node with active resources pacemaker-2: commit d967942723776dbdf79184f12a688028990bfeaf Author: Klaus Wenninger <klaus.wenninger> Date: Fri Aug 24 14:14:54 2018 +0200 Fix: crm_mon: rhbz1419548: show standby-node with active resources Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2129 |