Bug 1569615

Summary:	in behavior between the latest pacemaker package > 1.1.18-11.el7 compared to the previous version 1.1.16-12.el7_4.8
Product:	Red Hat Enterprise Linux 7	Reporter:	Jamie Reding <jamiere>
Component:	doc-High_Availability_Add-On_Reference	Assignee:	Steven J. Levine <slevine>
Status:	CLOSED CURRENTRELEASE	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	abeekhof, amit.banerjee, arsing, cluster-maint, dkinkead, kgaillot, rhel-docs
Target Milestone:	rc	Keywords:	Documentation
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-27 20:58:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jamie Reding 2018-04-19 15:10:08 UTC

While testing the Microsoft SQL Server ocf:mssql:ag resource agent on RHEL 7.4, we discovered a change in behavior between the latest pacemaker package 1.1.18-11.el7 compared to the previous version 1.1.16-12.el7_4.8.

$ yum list --show-duplicates pacemaker
 
<snip>
pacemaker.x86_64                 1.1.16-12.el7_4.8                 
rhel-ha-for-rhel-7-server-rpms
pacemaker.x86_64                 1.1.18-11.el7                     
rhel-ha-for-rhel-7-server-rpms
 
When the master replica of a resource fails the monitor action and
subsequently repeatedly fails the start action, we expect Pacemaker
to promote one of the slaves to master, but it does not do that.
  
I've created a smaller repro in the attached resource agent. Please
save it as /usr/lib/ocf/resource.d/mssql/mini. The original
ocf:mssql:ag agent manages an "Availability Group" resource running
inside a local SQL Server instance; the attached agent simulates this
by looking for certain files in root.
  
Repro steps:
 
Create the cluster between three RHEL 7.4 nodes (called hlsRHEL1,
hlsRHEL2, hlsRHEL3 here)
 
pcs cluster auth hlsRHEL1 hlsRHEL2 hlsRHEL3 -u hacluster -p hacluster
pcs cluster setup --name mycluster hlsRHEL1 hlsRHEL2 hlsRHEL3
pcs cluster start --all
pcs property set stonith-enabled=false
pcs property set start-failure-is-fatal=false
  
Pretend the underlying process is already running by running this on
all the nodes:
 
rm -f /mini-can-start /mini-started /mini-running /mini-primary
touch /mini-can-start /mini-started /mini-running
# mini-can-start allows the start action to succeed. mini-started implies the previous start action succeeded. mini-running implies that the underlying resource is running so the monitor action should succeed.
 
On the first node (hlsRHEL1), pretend that the underlying process is
already running as a master:
 
touch /mini-primary   # mini-primary implies that the underlying
resource is running as a master
 
Create the resource
 
pcs resource create test1 ocf:mssql:mini master notify=true
 
hlsRHEL1 should be successfully chosen as the master with no errors
on any node.
 
After some time (eg three successful monitor actions on each replica), pretend that the underlying process managed by the resource has irrecoverably crashed on the first node (hlsRHEL1):
 
rm -f /mini-can-start /mini-running   # monitor should start failing, and subsequence start actions should also fail

When running pacemaker 1.1.16, we observe the following on the original primary (hlsRHEL1):

Apr 17 19:44:26  mini(test1)[38595]:    INFO: test1 monitor
Apr 17 19:44:31  mini(test1)[38595]:    INFO: test1 monitor : 8
# mini-running deleted here
Apr 17 19:44:42  mini(test1)[38641]:    INFO: test1 monitor
Apr 17 19:44:47  mini(test1)[38641]:    INFO: test1 monitor : 1
# expected since mini-running was deleted
Apr 17 19:44:59  mini(test1)[38676]:    INFO: test1 demote
Apr 17 19:45:04  mini(test1)[38676]:    INFO: test1 demote : 1
Apr 17 19:45:21  mini(test1)[38725]:    INFO: test1 stop
Apr 17 19:45:26  mini(test1)[38725]:    INFO: test1 stop : 0
Apr 17 19:46:14  mini(test1)[38769]:    INFO: test1 start
Apr 17 19:46:19  mini(test1)[38769]:    INFO: test1 start: 1
# expected since mini-can-start was deleted
 
 ... and the following on one of the other replicas (say hlsRHEL2):
  
Apr 17 19:44:27  mini(test1)[39952]:    INFO: test1 monitor
Apr 17 19:44:32  mini(test1)[39952]:    INFO: test1 monitor : 0
Apr 17 19:44:32  mini(test1)[39998]:    INFO: test1 monitor
Apr 17 19:44:37  mini(test1)[39998]:    INFO: test1 monitor : 0
Apr 17 19:44:44  mini(test1)[40019]:    INFO: test1 monitor
Apr 17 19:45:41  mini(test1)[40144]:    INFO: test1 promote
Apr 17 19:45:46  mini(test1)[40144]:    INFO: test1 promote : 0
Apr 17 19:45:57  mini(test1)[40178]:    INFO: test1 monitor
Apr 17 19:46:02  mini(test1)[40178]:    INFO: test1 monitor : 8   #
now running as master
Apr 17 19:46:13  mini(test1)[40215]:    INFO: test1 monitor
Apr 17 19:46:18  mini(test1)[40215]:    INFO: test1 monitor : 8
 
In other words, while Pacemaker continues to restart the replica on the failed primary, it also simultaneously promotes one of the slaves to master.
 
But when running pacemaker 1.1.8, we instead observe that Pacemaker never promotes any of the slaves to master. It just keeps trying to restart the previous master.

Original primary (hlsRHEL1):
 
Apr 17 20:07:08  mini(test1)[46864]:    INFO: test1 monitor
Apr 17 20:07:13  mini(test1)[46864]:    INFO: test1 monitor : 8
# mini-running deleted here
Apr 17 20:07:24  mini(test1)[46910]:    INFO: test1 monitor
Apr 17 20:07:29  mini(test1)[46910]:    INFO: test1 monitor : 1
# expected since mini-running was deleted
Apr 17 20:07:34  mini(test1)[46945]:    INFO: test1 demote
Apr 17 20:07:39  mini(test1)[46945]:    INFO: test1 demote : 1
Apr 17 20:07:50  mini(test1)[46994]:    INFO: test1 stop
Apr 17 20:07:55  mini(test1)[46994]:    INFO: test1 stop : 0
Apr 17 20:07:55  mini(test1)[47011]:    INFO: test1 start
Apr 17 20:08:00  mini(test1)[47011]:    INFO: test1 start : 1
# expected since mini-can-start was deleted

Other replicas:
  
Apr 17 20:06:57  mini(test1)[47735]:    INFO: test1 monitor
Apr 17 20:07:02  mini(test1)[47735]:    INFO: test1 monitor : 0
Apr 17 20:07:09  mini(test1)[47774]:    INFO: test1 monitor
Apr 17 20:07:14  mini(test1)[47774]:    INFO: test1 monitor : 0
Apr 17 20:07:14  mini(test1)[47801]:    INFO: test1 monitor
Apr 17 20:07:19  mini(test1)[47801]:    INFO: test1 monitor : 0
Apr 17 20:07:26  mini(test1)[47821]:    INFO: test1 monitor
Apr 17 20:07:31  mini(test1)[47821]:    INFO: test1 monitor : 0
Apr 17 20:07:31  mini(test1)[47841]:    INFO: test1 monitor
Apr 17 20:07:36  mini(test1)[47841]:    INFO: test1 monitor : 0
Apr 17 20:07:46  mini(test1)[47894]:    INFO: test1 monitor
Apr 17 20:07:51  mini(test1)[47894]:    INFO: test1 monitor : 0
Apr 17 20:07:52  mini(test1)[47913]:    INFO: test1 monitor
Apr 17 20:07:57  mini(test1)[47913]:    INFO: test1 monitor : 0
Apr 17 20:08:07  mini(test1)[47965]:    INFO: test1 monitor
Apr 17 20:08:12  mini(test1)[47965]:    INFO: test1 monitor : 0
Apr 17 20:08:12  mini(test1)[48011]:    INFO: test1 monitor
Apr 17 20:08:17  mini(test1)[48011]:    INFO: test1 monitor : 0
Apr 17 20:08:27  mini(test1)[48063]:    INFO: test1 monitor
Apr 17 20:08:32  mini(test1)[48063]:    INFO: test1 monitor : 0
Apr 17 20:08:32  mini(test1)[48083]:    INFO: test1 monitor
Apr 17 20:08:37  mini(test1)[48083]:    INFO: test1 monitor : 0
Apr 17 20:08:47  mini(test1)[48136]:    INFO: test1 monitor
Apr 17 20:08:52  mini(test1)[48136]:    INFO: test1 monitor : 0
Apr 17 20:08:52  mini(test1)[48155]:    INFO: test1 monitor
Apr 17 20:08:57  mini(test1)[48155]:    INFO: test1 monitor : 0
 
Even if the cluster is left untouched for 30 minutes, neither of the slaves is promoted to master.

Interestingly, I found while creating this repro that the `sleep 5` for each action of the resource agent is important. If there is no sleep then one of the slaves *is* promoted to master and all is well. So there is some kind of timing issue involved.

Please let me know if you need any more details.
 
---
 
As an aside, in the past when we had issues with replicas not getting promoted, we would run `crm_simulate -sL` to look for the promotion score:
 
test1:1 promotion score on hlsRHEL1: 10
test1:0 promotion score on hlsRHEL2: 20
test1:2 promotion score on hlsRHEL3: 10
 
But it seems this information is no longer present in the output of that command in 1.1.18:
 
$ crm_simulate -sL
 
Current cluster status:
Online: [ hlsRHEL1 hlsRHEL2 hlsRHEL3 ]
  
Master/Slave Set: test1-master [test1]
     test1      (ocf::mssql:mini):      FAILED hlsRHEL1
     Slaves: [ hlsRHEL2 hlsRHEL3 ]
 
Allocation scores:
clone_color: test1-master allocation score on hlsRHEL1: 0
clone_color: test1-master allocation score on hlsRHEL2: 0
clone_color: test1-master allocation score on hlsRHEL3: 0
clone_color: test1:0 allocation score on hlsRHEL1: 1
clone_color: test1:0 allocation score on hlsRHEL2: 0
clone_color: test1:0 allocation score on hlsRHEL3: 0
clone_color: test1:1 allocation score on hlsRHEL1: 0
clone_color: test1:1 allocation score on hlsRHEL2: 11
clone_color: test1:1 allocation score on hlsRHEL3: 0
clone_color: test1:2 allocation score on hlsRHEL1: 0
clone_color: test1:2 allocation score on hlsRHEL2: 0
clone_color: test1:2 allocation score on hlsRHEL3: 11
native_color: test1:1 allocation score on hlsRHEL1: 0
native_color: test1:1 allocation score on hlsRHEL2: 11
native_color: test1:1 allocation score on hlsRHEL3: 0
native_color: test1:2 allocation score on hlsRHEL1: 0
native_color: test1:2 allocation score on hlsRHEL2: -INFINITY
native_color: test1:2 allocation score on hlsRHEL3: 11
native_color: test1:0 allocation score on hlsRHEL1: 1
native_color: test1:0 allocation score on hlsRHEL2: -INFINITY
native_color: test1:0 allocation score on hlsRHEL3: -INFINITY
 
Transition Summary:
* Recover    test1:0     (           Slave hlsRHEL1 )
* Promote    test1:1     ( Slave -> Master hlsRHEL2 )
 
Is this intentional or a side effect of the same issue?

================================================================================
EMAIL REPLY FROM Ken Gaillot
On 4/18/18, 4:10 PM, "Ken Gaillot" <kgaillot> wrote:

    I have reproduced the described behavior, and it does appear to be a
    possible regression.

    Regarding the QA process, I am sure we can and should find a way to
    coordinate, which can be discussed separately.

    Regarding the technical issue here, I am investigating the cause.

    Pacemaker's policy engine component generates identical scheduling in
    both versions, so the issue appears to be related to the crmd
    component's execution of the scheduled actions.

================================================================================

Comment 2 Ken Gaillot 2018-04-19 16:54:11 UTC

The new behavior is actually correct. Pacemaker always schedules promote actions after *all* pending start actions for the clone have completed (regardless of whether they are on the same node or not).

The 7.4 behavior arises from Bug 1519379, in which Pacemaker mistakenly schedules a stop action instead of full recovery when a demote action fails. (The test agent used here fails the demote as well as the master monitor in this incident.) Because there is no start action immediately scheduled on the original master, the promote is allowed to proceed.

Interestingly, the fix for Bug 1519379 is planned for 7.6 and not in the 7.5 packages, so 7.5 is coincidentally arriving at the correct behavior for other reasons.

In 7.5, it does initially schedule a stop instead of full recovery -- but after the stop action completes and before the promote action can be initiated, the transition is interrupted by the result of clone notifications from the previous transition. Thus, a new transition is calculated which does schedule a start and a promote, and the promote is unable to proceed due to the failed start.

While the observed behavior is not a bug, I am keeping this bz open for two unresolved questions:

* Why are late clone notifications interrupting the transition in 7.5 and not 7.4?

* The promote-after-all-starts behavior is intentional because it is the safest course of action. However, the situation is very similar to the purpose of the "interleave" clone meta-attribute, which controls whether another clone that depends on this clone must wait for all instances of the original clone to be up, or can start after just the local instance on the same node is up. It may make sense to make interleave apply to this situation as well, allowing promotion to occur after the local instance is started, regardless of other instance starts pending.

Comment 3 Ken Gaillot 2018-04-19 17:06:21 UTC

To give a bit more information on the sequence of events here:

In both 7.4 and 7.5, when the master monitor fails, these actions are scheduled:

demote test1-master on original master
promote test1-master on another node

The bug is that the original master is scheduled for a demotion and a full stop, rather than demote-stop-start. Since there is no start, the promote is allowed to proceed in parallel with the demote.

In both versions, the demote fails before the promote can be attempted. At this point, due to the bug, both versions schedule:

stop test1-master on original master
promote test1-master on another node

In 7.4, the stop succeeds, and the promote succeeds. test1-master will not be started (in slave mode) on the original master until the next natural transition (due to some unrelated event or the cluster recheck interval).

In 7.5, the stop succeeds, but before the promote can be initiated, late notification results abort the transition. Since the demote failure has been handled, the bug no longer applies, and the correct sequence is now scheduled:

start test1-master on original master
promote test1-master on another node

Since the start never succeeds, the promote never happens. This is the intended behavior, as pacemaker knows nothing about how the cloned service functions, and cannot assume that promotion is safe when another node is starting in slave mode. (As mentioned before, the interleave option is probably a good way of allowing the user to say that it is safe, to avoid situations like this.)

Comment 4 Ken Gaillot 2018-04-19 22:58:01 UTC

(In reply to Ken Gaillot from comment #2)
> * Why are late clone notifications interrupting the transition in 7.5 and
> not 7.4?

This was a tricky one: a long-existing bug looks like a regression because a new feature triggers it.

As of 7.5, the fix for Bug 1461976 switched the default for the record-pending property to true. This allows status displays to show actions that are in progress, rather than just their last known state.

The newly found bug treats the record indicating a pending notify action as indicating its successful completion. Thus, when the notify action actually completes, it is treated as a "late" result that interrupts the transition.

I will file a separate bz for that issue, and this bz can focus on the question of whether interleave should affect whether promotion waits for starts on other nodes.

Comment 5 Ken Gaillot 2018-04-20 16:51:53 UTC

(In reply to Ken Gaillot from comment #4)
> The newly found bug treats the record indicating a pending notify action as
> indicating its successful completion. Thus, when the notify action actually
> completes, it is treated as a "late" result that interrupts the transition.
> 
> I will file a separate bz for that issue

Created Bug 1570130

Comment 6 Ken Gaillot 2018-04-23 14:49:22 UTC

After discussing the issue with Andrew Beekhof, we will continue to order promotes after all starts. A start action is likely to change master scores, requiring a new calculation of which node will be master. Waiting for all pending starts to complete avoids unnecessarily moving the master repeatedly, and avoids resource agents having to have intelligence handling simultaneous promotion on one node and starting on another node, which could have race conditions.

The behavior that arose here is a known risk of setting start-failure-is-fatal to false. When that is the case, it allows one faulty node (unable to start a resource) to hold up all dependent actions. That is why start-failure-is-fatal defaults to true.

The risk of start-failure-is-fatal=false can be mitigated by setting a low migration-threshold so that other actions can proceed after that many failures. (Currently, migration-threshold is per resource, so unfortunately you can't set one threshold for start failures and another for monitor failures, but that capability is planned as the feature Bug 1371576.)

The separate Bug 1570130 that created the timing issue that exposed this behavior has been fixed and is planned for the next z-stream batch as Bug 1570618. Be aware that this will restore the 7.4 behavior, but that is the result of Bug 1519379 which will be fixed in 7.6, which will bring us back to the behavior here.

I'm reassigning this bz as a documentation issue. Docs: Where we document the behavior of start-failure-is-fatal, we should discuss the risk described in this comment.

Comment 7 Arnav Singh 2018-04-23 17:35:02 UTC

As I had said earlier, we intentionally ask customers to set start-failure-is-fatal to false. The resource may have failed to start now, but it might also succeed in a few minutes. We don't want start failures to prevent the replica from ever starting up on that node again until manual intervention. This is especially important since many customers run with the bare minimum number of cluster nodes (three) so they can only tolerate one replica being permanently removed in this fashion.

So the recommendation to set start-failure-is-fatal to true does not work for our customers. Similarly, migration-threshold=N also permanently blocks the resource after N start failures, so it does not work for our customers.

The only alternative I can see is that we change our resource agent to make `start` a no-op success and *actually* start the resource in `monitor`, or tell our customers to run a cron job with `pcs resource cleanup` every minute, or other such similarly hacky workaround.

I don't disagree that promotes should be ordered after starts since starts can change the promotion score, but if a start has failed I don't see why pacemaker needs to block the promote on it. So while the behavior of 7.4 and earlier was because of a bug, it was also one that made more intuitive sense. Alternatively, perhaps you can add a new exit code for the start action that opts in to the "I failed, but I promise I'm fine with you promoting someone else" behavior.

Comment 8 Andrew Beekhof 2018-04-23 23:57:16 UTC

(In reply to Arnav Singh from comment #7)
> As I had said earlier, we intentionally ask customers to set
> start-failure-is-fatal to false. The resource may have failed to start now,
> but it might also succeed in a few minutes. We don't want start failures to
> prevent the replica from ever starting up on that node again until manual
> intervention.

One does not necessarily imply the other.

We encourage start-failure-is-fatal=true _in_combination_with_ setting a failure timeout.  This allows the cluster to attempt restarting the resource after some suitable period of time (during which the cluster is able to make progress promoting or demoting other copies).

We use this combination extensively in other deployments.

Comment 9 Arnav Singh 2018-04-24 02:11:06 UTC

Thanks. I need to confirm this on the older versions of pacemaker that we support, but atleast for 1.1.16 and 1.1.18 on RHEL the combination of start-failure-is-fatal=true + failure-timeout=30s does work as you say.

I know we looked at failure-timeout during development last year but I'm not sure why we discounted it. It *might* have been because the doc at https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html talks about it in the context of migration-threshold, then explains how migration-threshold does not apply if the resource fails the start action.

Comment 12 Steven J. Levine 2018-08-27 20:58:56 UTC

Updated info is on the Portal in table 12.1:



https://doc-stage.usersys.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/#tb-clusterprops-HAAR