Bug 1360768

Summary:	when doing a pcs disable and then quickly an enable of a multi-master galera resource, the resource can end up in stopped/unmanaged state
Product:	Red Hat Enterprise Linux 7	Reporter:	Michele Baldessari <michele>
Component:	resource-agents	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED ERRATA	QA Contact:	Asaf Hirshberg <ahirshbe>
Severity:	low	Docs Contact:
Priority:	low
Version:	7.3	CC:	abeekhof, agk, cfeist, cluster-maint, dciabrin, fdinitto, kgaillot, mnovacek, oalbrigt, rscarazz, ushkalim
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-84.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-01 14:55:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michele Baldessari 2016-07-27 12:41:19 UTC

Description of problem:
So this is mostly a User Experience kind of thing. While testing for the
Next Generation HA Architecture we noticed that a moderately quick sequence of:
pcs disable galera
pcs enable galera

would lead to the service being down. Basically this:
Jul 26 18:05:02 overcloud-controller-0 sudo[56878]: heat-admin : TTY=pts/0 ; PWD=/home/heat-admin ; USER=root ; COMMAND=/sbin/pcs resource disable galera
Jul 26 18:05:23 overcloud-controller-0 sudo[59501]: heat-admin : TTY=pts/0 ; PWD=/home/heat-admin ; USER=root ; COMMAND=/sbin/pcs resource enable galera

Led to this:
Master/Slave Set: galera-master [galera]                                                        
    galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-2 (unmanaged) 
    galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-0 (unmanaged) 
    galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-1 (unmanaged) 

Failed Actions:                                                                                                                                                                               
* galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=204, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',                                                                                                                                                                                 
    last-rc-change='Tue Jul 26 18:05:31 2016', queued=0ms, exec=89ms                                                                                                                          
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=218, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',                                                                                                                                                                                 
    last-rc-change='Tue Jul 26 18:05:35 2016', queued=0ms, exec=99ms                                                                                                                          
* galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=205, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',                                                                                                                                                                                 
    last-rc-change='Tue Jul 26 18:05:56 2016', queued=0ms, exec=96ms    

This BZ is to track the possibility of making the RA a bit smarter and either
wait for the disable to complete before trying the enable or signal somehow to pcs that the enable step cannot be started until the disable one is completed.

Full sosreports can be found here:

Comment 2 Michele Baldessari 2016-07-27 12:50:55 UTC

Fabio suggested that maybe the best way is to just make pcs resource  {disable,enable} --wait the default and be done with it?

Comment 4 Damien Ciabrini 2016-07-27 17:25:50 UTC

The logics implemented in the Galera resource agent assumes that when a "promote" is triggered by pacemaker, either one of the condition below holds:
. at least one galera node is running ("master" in pacemaker)
. or no galera node is running and a bootstrap node has been selected (attribute stored in the CIB)

If none of the condition is true, the resource agent errors, because:
. it cannot bootstrap a new cluster because it doesn't know if it can act as a bootstrap node (it needs to know the seqno on all the nodes to find out)
. it cannot determine a new bootstrap node during the "promote" op, because it has to wait for all nodes to report their seqno status

Since the resource is configured with "op promote on-fail=block", there's no retry attempted to recover.

Now looking at the logs, when pacemaker is asked to "disable resource galera", it first "demote" all the 3 nodes. As soon as the 3 nodes are demoted, pacemaker honours the "pcs resource enable" command by scheduling a "promote" on all the nodes.

Andrew, I would have expected that a promote command would only be triggered based on the outcome of the resource agent logics. It seems like at this step the "demote" op has succeeded but the "master" status is still set on the node and thus it's eligible for promotion. Is the actual behaviour the expected semantics?

I can see a few options to avoid this pitfall:
1. make the client use --wait to avoid triggering any race

2. as pointed in #c2, make --wait the default for "pcs resource disable" (which has probably many side effects)

3. change the behaviour of pcs to prevent promotion if a demotion is in progress, like "abort, due to demote in progress, use --force to override".

4. ensure that when a "demote" is ok, the node cannot be promoted back to master without the agent requesting so (like it's done today with $CRM_MASTER -v 100).

At this point, I think either 3 or 4 would make sense.

Comment 5 Ken Gaillot 2016-07-27 21:41:54 UTC

I would advise against the RA using --wait or pcs making --wait the default. It waits for the cluster (crmd/pengine) to have no scheduled activity. Having the RA wait for cluster inactivity could be a nasty feedback loop. Also, it is possible for the cluster to get into a state where it repeatedly schedules an action that can't complete, causing --wait to never return (an example is that the cluster will continuously try to reconnect to a gracefully stopped remote node until the remote node comes back or it fails enough times).

The better place to use --wait would be whatever is making the enable/disable calls.

I haven't investigated the logs yet, but I would guess your point 4 is relevant: the RA can call crm_master to influence whether a node can be promoted.

Comment 6 Andrew Beekhof 2016-07-28 01:00:50 UTC

(In reply to Ken Gaillot from comment #5)
> I would advise against the RA using --wait or pcs making --wait the default.
> It waits for the cluster (crmd/pengine) to have no scheduled activity.
> Having the RA wait for cluster inactivity could be a nasty feedback loop.

Its worse than that, its a 100% guaranteed deadlock unless called from a _recurring_ monitor operation.

For every other operation, the command waits for the cluster to be idle, but the cluster cannot be idle until the operation completes.

> Also, it is possible for the cluster to get into a state where it repeatedly
> schedules an action that can't complete, causing --wait to never return (an
> example is that the cluster will continuously try to reconnect to a
> gracefully stopped remote node until the remote node comes back or it fails
> enough times).
> 
> The better place to use --wait would be whatever is making the
> enable/disable calls.

Yes, and hopefully that isn't the RA.

> 
> I haven't investigated the logs yet, but I would guess your point 4 is
> relevant: the RA can call crm_master to influence whether a node can be
> promoted.

Comment 7 Andrew Beekhof 2016-07-28 01:12:01 UTC

(In reply to Damien Ciabrini from comment #4)
> The logics implemented in the Galera resource agent assumes that when a
> "promote" is triggered by pacemaker, either one of the condition below holds:
>   . at least one galera node is running ("master" in pacemaker)
>   . or no galera node is running and a bootstrap node has been selected
> (attribute stored in the CIB)
> 
> If none of the condition is true, the resource agent errors, because:
>   . it cannot bootstrap a new cluster because it doesn't know if it can act
> as a bootstrap node (it needs to know the seqno on all the nodes to find out)
>   . it cannot determine a new bootstrap node during the "promote" op,
> because it has to wait for all nodes to report their seqno status
> 
> Since the resource is configured with "op promote on-fail=block", there's no
> retry attempted to recover.
> 
> Now looking at the logs, when pacemaker is asked to "disable resource
> galera", it first "demote" all the 3 nodes.

By an admin or installer? Or by the agent?

> As soon as the 3 nodes are
> demoted, pacemaker honours the "pcs resource enable" command by scheduling a
> "promote" on all the nodes.
> 
> Andrew, I would have expected that a promote command would only be triggered
> based on the outcome of the resource agent logics.

This is not the case.
It is up to the external entity to ensure a command has been sufficiently enacted before issuing subsequent commands.

> It seems like at this
> step the "demote" op has succeeded but the "master" status is still set on
> the node and thus it's eligible for promotion. Is the actual behaviour the
> expected semantics?

Yes.

> 
> I can see a few options to avoid this pitfall:
>  1. make the client use --wait to avoid triggering any race 

this ^^^

like every API call in OSP, they are asynchronous by default.
at least pacemaker gives the caller an easy way to wait for them to take effect.

>  2. as pointed in #c2, make --wait the default for "pcs resource disable"
> (which has probably many side effects)

dangerous for use in agents

>  3. change the behaviour of pcs to prevent promotion if a demotion is in
> progress, like "abort, due to demote in progress, use --force to override". 

I guess it could infer this by checking for target-role + the same "is the cluster busy" check that --wait uses.  But really, just go with #1

>  4. ensure that when a "demote" is ok, the node cannot be promoted back to
> master without the agent requesting so (like it's done today with
> $CRM_MASTER -v 100).
> 
> At this point, I think either 3 or 4 would make sense.

I don't see why we should push the complexity into pcs or the agent.
It's far more reasonable for the caller to be more patient/careful about what instructions they're sending.  Especially given how easy it is to do so.

Comment 8 Raoul Scarazzini 2016-07-28 10:46:52 UTC

Just to add more context on this, the way we got into the first error was while testing a deployed tripleo/HA/Newton env.
Those tests where done using our CI, specifically this [1] project, and if you look at this [2] function you will see that each action played on a resource has a consequent wait time to get to the interested state.
If the status is not reached, than we consider the test as a failure.
Stated that we hit this just one time and this can be considered a race, and stated that maybe something is missing in the way the script get rid of the resource status (we're are treating a multi state resource) maybe here is not the wait issue we need to check.

Other info:

1) Before the promotion failure that we see in the output the galera resource were correctly stopped (look at Jul 26 17:53:59 in the logs, status was Stopped on all the nodes) and after a while (so when the start operation was launched by the tests) some errors are present in the logs.

2) Maybe it's worth having a specific bug for this, but while creating the resource there's also this message in the log:

Jul 26 17:36:41 [38651] overcloud-controller-2       lrmd:   notice: operation_finished:        galera_start_0:45771:stderr [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]

[1] https://github.com/rscarazz/tripleo-director-ha-test-suite
[2] https://github.com/rscarazz/tripleo-director-ha-test-suite/blob/master/include/functions#L125-L148

Comment 9 Raoul Scarazzini 2016-07-28 11:03:40 UTC

Another important additional info is that we were testing the NG architecture for Newton.
We ended up in reproducing the error on the test in today's new deployment.
The test stops every systemd resource, stops Galera and Rabbitmq, and then starts every systemd resource.
Objective is having every systemd resource started without any problem, even if core services are stopped.

This is the output of the test command:

    Thu Jul 28 10:41:22 UTC 2016 - Populationg overcloud elements...OK
    Thu Jul 28 10:41:22 UTC 2016 - Test: Stop every systemd resource, stop Galera and Rabbitmq, Start every systemd resource
    Thu Jul 28 10:41:22 UTC 2016 * Step 1: disable all the systemd resources
    Thu Jul 28 10:41:22 UTC 2016 - Performing action disable on resource mongod .............OK
    Thu Jul 28 10:41:37 UTC 2016 - Performing action disable on resource openstack-cinder-volume OK
    Thu Jul 28 10:41:37 UTC 2016 - Performing action disable on resource httpd OK
    Thu Jul 28 10:41:38 UTC 2016 - List of cluster's failed actions:
    Cluster is OK.
    Thu Jul 28 10:41:38 UTC 2016 * Step 2: disable core services
    Thu Jul 28 10:41:38 UTC 2016 - Performing action disable on resource galera OK
    Thu Jul 28 10:41:39 UTC 2016 - Performing action disable on resource rabbitmq-clone OK
    Thu Jul 28 10:41:39 UTC 2016 - List of cluster's failed actions:
    Cluster is OK.
    Thu Jul 28 10:41:40 UTC 2016 * Step 3: enable each resource one by one and check the status
    Thu Jul 28 10:41:40 UTC 2016 - Performing action enable on resource mongod .......OK
    Thu Jul 28 10:41:48 UTC 2016 - Performing action enable on resource openstack-cinder-volume OK
    Thu Jul 28 10:41:48 UTC 2016 - Performing action enable on resource httpd OK
    Thu Jul 28 10:41:48 UTC 2016 - List of cluster's failed actions:
    Cluster is OK.
    Thu Jul 28 10:41:49 UTC 2016 - Waiting 10 seconds to recover environment
    Thu Jul 28 10:41:59 UTC 2016 - Recovery: Enable all systemd and core resources, cleanup failed actions
    Thu Jul 28 10:41:59 UTC 2016 * Step 1: enable core resources
    Thu Jul 28 10:41:59 UTC 2016 - Performing action enable on resource galera OK
    Thu Jul 28 10:41:59 UTC 2016 - Performing action enable on resource rabbitmq-clone OK
    Thu Jul 28 10:42:00 UTC 2016 * Step 2: enable all the systemd resources
    Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource mongod OK
    Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource openstack-cinder-volume OK
    Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource httpd OK
    Thu Jul 28 10:42:01 UTC 2016 * Step 3: Waiting all resources to start
    ...........................................................Problems during recovery!
    Thu Jul 28 10:43:38 UTC 2016 - List of cluster's failed actions:
    Cluster has failed actions:
    galera
    Recovery /home/heat-admin/tripleo-director-ha-test-suite/recovery/recovery_pacemaker-light FAILED!

So basically, systemd resources stops fine (step 1), core resources too (step 2), systemd resources starts again fine (step 3). So until here we're ok, the test part is done, what breaks is the recovery part in which the enablement of the core resources fails because of galera.
But we need to add a consideration here: the step 1 of the recovery in theory WAITS for the resource to reach the desired status, so in this case we have two options:

1) There's something wrong with the way the script calculates galera status;
2) Galera breaks after starting correctly;

We are doing tests to choose one of these two.

Comment 10 Andrew Beekhof 2016-07-28 23:31:11 UTC

(In reply to Raoul Scarazzini from comment #8)
> Just to add more context on this, the way we got into the first error was
> while testing a deployed tripleo/HA/Newton env.
> Those tests where done using our CI, specifically this [1] project, and if
> you look at this [2] function you will see that each action played on a
> resource has a consequent wait time to get to the interested state.

wait_resource_status() is wrong.
It will incorrectly return as soon as the service is Stopped/Started on _any_ node, not _all_ nodes.

Depending on the setup, this could be immediately, before _any_ action has been performed... all you need is a compute node under control of pacemaker remote (which never runs the service) or a control node with 'pcs resource ban' for that resource.

Please use --wait when enabling and disabling resources.

Comment 11 Raoul Scarazzini 2016-07-29 09:26:18 UTC

(In reply to Andrew Beekhof from comment #10)
> (In reply to Raoul Scarazzini from comment #8)
> > Just to add more context on this, the way we got into the first error was
> > while testing a deployed tripleo/HA/Newton env.
> > Those tests where done using our CI, specifically this [1] project, and if
> > you look at this [2] function you will see that each action played on a
> > resource has a consequent wait time to get to the interested state.
> 
> wait_resource_status() is wrong.
> It will incorrectly return as soon as the service is Stopped/Started on
> _any_ node, not _all_ nodes.

I agree that using --wait will be better, but wait_resource_status is doing right in my opinion, it waits for:

1) Timeout reached;
2) No output, which means that the desired status is reached on *all* the nodes, not just one of them. If there is at least one node with a different status the output will not be empty and so it will become a FAILURE (once timeout will be reached);

Do you agree? Am I missing something?

In addition: the only problem with wait_resource_status was about Master/Slave resources which were not considered (now I patched the functions, see here https://github.com/rscarazz/tripleo-director-ha-test-suite/blob/master/include/functions#L78) and this was causing the exit status to be always 0 for this kind of resources (galera), causing the problem that started this bug.

> Depending on the setup, this could be immediately, before _any_ action has
> been performed... all you need is a compute node under control of pacemaker
> remote (which never runs the service) or a control node with 'pcs resource
> ban' for that resource.
> 
> Please use --wait when enabling and disabling resources.

What I am concerned about using --wait (even if I agree that seems to be a more reliable approach) is that it involves also all the dependent resource in the wait time.
The function wait_resource_status looks just at the desired resource, once it reaches the status it exits, but in any case the overall cluster status is checked (if you look at the tests there's always a check_failed_actions after every action), so if a dependent resource will have problems test will fail anyhow.

Comment 12 Andrew Beekhof 2016-07-31 23:31:48 UTC

(In reply to Raoul Scarazzini from comment #11)
> (In reply to Andrew Beekhof from comment #10)
> > (In reply to Raoul Scarazzini from comment #8)
> > > Just to add more context on this, the way we got into the first error was
> > > while testing a deployed tripleo/HA/Newton env.
> > > Those tests where done using our CI, specifically this [1] project, and if
> > > you look at this [2] function you will see that each action played on a
> > > resource has a consequent wait time to get to the interested state.
> > 
> > wait_resource_status() is wrong.
> > It will incorrectly return as soon as the service is Stopped/Started on
> > _any_ node, not _all_ nodes.
> 
> I agree that using --wait will be better, but wait_resource_status is doing
> right in my opinion,

It clearly wasn't, otherwise this bug wouldn't exist.

You've already had to update it this weekend to support master/slave. However the race conditions remain and so the function cannot be considered reliable.
 
> What I am concerned about using --wait (even if I agree that seems to be a
> more reliable approach) is that it involves also all the dependent resource
> in the wait time.
> The function wait_resource_status looks just at the desired resource, once
> it reaches the status it exits, but in any case the overall cluster status
> is checked (if you look at the tests there's always a check_failed_actions
> after every action), so if a dependent resource will have problems test will
> fail anyhow.

A cluster that can't reach a stable state has a problem that should be investigated.
Failing the test is the right thing to do.

Comment 13 Damien Ciabrini 2016-11-18 14:04:48 UTC

So heads up, I think what made the resource agent confused is that pacemaker is allowed to promote the resource right after it has been demoted, without going through a start or monitor op first. The agent wrongly assumes that it's the only one which can promote a resource raising its master score. 

During a demote, the master score is not cleared; the agent let pacemaker clean it after the demote op finishes. However, pacemaker can be asked to promote the resource right after a demote, which is likely the case triggered in this bz. In such situation, pacemaker is allowed to perform the promotion since the master score is still set.

Changing the resource agent to make it clear the master score should fix the issue.

Comment 14 Damien Ciabrini 2016-11-18 14:07:27 UTC

Proposed fix in https://github.com/ClusterLabs/resource-agents/pull/885 has been merged in upstream.

Comment 17 Asaf Hirshberg 2017-01-02 06:00:34 UTC

Verified on: resource-agents-3.9.5-84.el7.x86_64

Comment 18 errata-xmlrpc 2017-08-01 14:55:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1844

Comment 19 Damien Ciabrini 2018-07-20 08:34:38 UTC

*** Bug 1380451 has been marked as a duplicate of this bug. ***