Bug 1360768
Summary: | when doing a pcs disable and then quickly an enable of a multi-master galera resource, the resource can end up in stopped/unmanaged state | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michele Baldessari <michele> |
Component: | resource-agents | Assignee: | Damien Ciabrini <dciabrin> |
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 7.3 | CC: | abeekhof, agk, cfeist, cluster-maint, dciabrin, fdinitto, kgaillot, mnovacek, oalbrigt, rscarazz, ushkalim |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | resource-agents-3.9.5-84.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 14:55:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michele Baldessari
2016-07-27 12:41:19 UTC
Fabio suggested that maybe the best way is to just make pcs resource {disable,enable} --wait the default and be done with it? The logics implemented in the Galera resource agent assumes that when a "promote" is triggered by pacemaker, either one of the condition below holds: . at least one galera node is running ("master" in pacemaker) . or no galera node is running and a bootstrap node has been selected (attribute stored in the CIB) If none of the condition is true, the resource agent errors, because: . it cannot bootstrap a new cluster because it doesn't know if it can act as a bootstrap node (it needs to know the seqno on all the nodes to find out) . it cannot determine a new bootstrap node during the "promote" op, because it has to wait for all nodes to report their seqno status Since the resource is configured with "op promote on-fail=block", there's no retry attempted to recover. Now looking at the logs, when pacemaker is asked to "disable resource galera", it first "demote" all the 3 nodes. As soon as the 3 nodes are demoted, pacemaker honours the "pcs resource enable" command by scheduling a "promote" on all the nodes. Andrew, I would have expected that a promote command would only be triggered based on the outcome of the resource agent logics. It seems like at this step the "demote" op has succeeded but the "master" status is still set on the node and thus it's eligible for promotion. Is the actual behaviour the expected semantics? I can see a few options to avoid this pitfall: 1. make the client use --wait to avoid triggering any race 2. as pointed in #c2, make --wait the default for "pcs resource disable" (which has probably many side effects) 3. change the behaviour of pcs to prevent promotion if a demotion is in progress, like "abort, due to demote in progress, use --force to override". 4. ensure that when a "demote" is ok, the node cannot be promoted back to master without the agent requesting so (like it's done today with $CRM_MASTER -v 100). At this point, I think either 3 or 4 would make sense. I would advise against the RA using --wait or pcs making --wait the default. It waits for the cluster (crmd/pengine) to have no scheduled activity. Having the RA wait for cluster inactivity could be a nasty feedback loop. Also, it is possible for the cluster to get into a state where it repeatedly schedules an action that can't complete, causing --wait to never return (an example is that the cluster will continuously try to reconnect to a gracefully stopped remote node until the remote node comes back or it fails enough times). The better place to use --wait would be whatever is making the enable/disable calls. I haven't investigated the logs yet, but I would guess your point 4 is relevant: the RA can call crm_master to influence whether a node can be promoted. (In reply to Ken Gaillot from comment #5) > I would advise against the RA using --wait or pcs making --wait the default. > It waits for the cluster (crmd/pengine) to have no scheduled activity. > Having the RA wait for cluster inactivity could be a nasty feedback loop. Its worse than that, its a 100% guaranteed deadlock unless called from a _recurring_ monitor operation. For every other operation, the command waits for the cluster to be idle, but the cluster cannot be idle until the operation completes. > Also, it is possible for the cluster to get into a state where it repeatedly > schedules an action that can't complete, causing --wait to never return (an > example is that the cluster will continuously try to reconnect to a > gracefully stopped remote node until the remote node comes back or it fails > enough times). > > The better place to use --wait would be whatever is making the > enable/disable calls. Yes, and hopefully that isn't the RA. > > I haven't investigated the logs yet, but I would guess your point 4 is > relevant: the RA can call crm_master to influence whether a node can be > promoted. (In reply to Damien Ciabrini from comment #4) > The logics implemented in the Galera resource agent assumes that when a > "promote" is triggered by pacemaker, either one of the condition below holds: > . at least one galera node is running ("master" in pacemaker) > . or no galera node is running and a bootstrap node has been selected > (attribute stored in the CIB) > > If none of the condition is true, the resource agent errors, because: > . it cannot bootstrap a new cluster because it doesn't know if it can act > as a bootstrap node (it needs to know the seqno on all the nodes to find out) > . it cannot determine a new bootstrap node during the "promote" op, > because it has to wait for all nodes to report their seqno status > > Since the resource is configured with "op promote on-fail=block", there's no > retry attempted to recover. > > Now looking at the logs, when pacemaker is asked to "disable resource > galera", it first "demote" all the 3 nodes. By an admin or installer? Or by the agent? > As soon as the 3 nodes are > demoted, pacemaker honours the "pcs resource enable" command by scheduling a > "promote" on all the nodes. > > Andrew, I would have expected that a promote command would only be triggered > based on the outcome of the resource agent logics. This is not the case. It is up to the external entity to ensure a command has been sufficiently enacted before issuing subsequent commands. > It seems like at this > step the "demote" op has succeeded but the "master" status is still set on > the node and thus it's eligible for promotion. Is the actual behaviour the > expected semantics? Yes. > > I can see a few options to avoid this pitfall: > 1. make the client use --wait to avoid triggering any race this ^^^ like every API call in OSP, they are asynchronous by default. at least pacemaker gives the caller an easy way to wait for them to take effect. > 2. as pointed in #c2, make --wait the default for "pcs resource disable" > (which has probably many side effects) dangerous for use in agents > 3. change the behaviour of pcs to prevent promotion if a demotion is in > progress, like "abort, due to demote in progress, use --force to override". I guess it could infer this by checking for target-role + the same "is the cluster busy" check that --wait uses. But really, just go with #1 > 4. ensure that when a "demote" is ok, the node cannot be promoted back to > master without the agent requesting so (like it's done today with > $CRM_MASTER -v 100). > > At this point, I think either 3 or 4 would make sense. I don't see why we should push the complexity into pcs or the agent. It's far more reasonable for the caller to be more patient/careful about what instructions they're sending. Especially given how easy it is to do so. Just to add more context on this, the way we got into the first error was while testing a deployed tripleo/HA/Newton env. Those tests where done using our CI, specifically this [1] project, and if you look at this [2] function you will see that each action played on a resource has a consequent wait time to get to the interested state. If the status is not reached, than we consider the test as a failure. Stated that we hit this just one time and this can be considered a race, and stated that maybe something is missing in the way the script get rid of the resource status (we're are treating a multi state resource) maybe here is not the wait issue we need to check. Other info: 1) Before the promotion failure that we see in the output the galera resource were correctly stopped (look at Jul 26 17:53:59 in the logs, status was Stopped on all the nodes) and after a while (so when the start operation was launched by the tests) some errors are present in the logs. 2) Maybe it's worth having a specific bug for this, but while creating the resource there's also this message in the log: Jul 26 17:36:41 [38651] overcloud-controller-2 lrmd: notice: operation_finished: galera_start_0:45771:stderr [ cat: /var/lib/mysql/grastate.dat: No such file or directory ] [1] https://github.com/rscarazz/tripleo-director-ha-test-suite [2] https://github.com/rscarazz/tripleo-director-ha-test-suite/blob/master/include/functions#L125-L148 Another important additional info is that we were testing the NG architecture for Newton. We ended up in reproducing the error on the test in today's new deployment. The test stops every systemd resource, stops Galera and Rabbitmq, and then starts every systemd resource. Objective is having every systemd resource started without any problem, even if core services are stopped. This is the output of the test command: Thu Jul 28 10:41:22 UTC 2016 - Populationg overcloud elements...OK Thu Jul 28 10:41:22 UTC 2016 - Test: Stop every systemd resource, stop Galera and Rabbitmq, Start every systemd resource Thu Jul 28 10:41:22 UTC 2016 * Step 1: disable all the systemd resources Thu Jul 28 10:41:22 UTC 2016 - Performing action disable on resource mongod .............OK Thu Jul 28 10:41:37 UTC 2016 - Performing action disable on resource openstack-cinder-volume OK Thu Jul 28 10:41:37 UTC 2016 - Performing action disable on resource httpd OK Thu Jul 28 10:41:38 UTC 2016 - List of cluster's failed actions: Cluster is OK. Thu Jul 28 10:41:38 UTC 2016 * Step 2: disable core services Thu Jul 28 10:41:38 UTC 2016 - Performing action disable on resource galera OK Thu Jul 28 10:41:39 UTC 2016 - Performing action disable on resource rabbitmq-clone OK Thu Jul 28 10:41:39 UTC 2016 - List of cluster's failed actions: Cluster is OK. Thu Jul 28 10:41:40 UTC 2016 * Step 3: enable each resource one by one and check the status Thu Jul 28 10:41:40 UTC 2016 - Performing action enable on resource mongod .......OK Thu Jul 28 10:41:48 UTC 2016 - Performing action enable on resource openstack-cinder-volume OK Thu Jul 28 10:41:48 UTC 2016 - Performing action enable on resource httpd OK Thu Jul 28 10:41:48 UTC 2016 - List of cluster's failed actions: Cluster is OK. Thu Jul 28 10:41:49 UTC 2016 - Waiting 10 seconds to recover environment Thu Jul 28 10:41:59 UTC 2016 - Recovery: Enable all systemd and core resources, cleanup failed actions Thu Jul 28 10:41:59 UTC 2016 * Step 1: enable core resources Thu Jul 28 10:41:59 UTC 2016 - Performing action enable on resource galera OK Thu Jul 28 10:41:59 UTC 2016 - Performing action enable on resource rabbitmq-clone OK Thu Jul 28 10:42:00 UTC 2016 * Step 2: enable all the systemd resources Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource mongod OK Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource openstack-cinder-volume OK Thu Jul 28 10:42:00 UTC 2016 - Performing action enable on resource httpd OK Thu Jul 28 10:42:01 UTC 2016 * Step 3: Waiting all resources to start ...........................................................Problems during recovery! Thu Jul 28 10:43:38 UTC 2016 - List of cluster's failed actions: Cluster has failed actions: galera Recovery /home/heat-admin/tripleo-director-ha-test-suite/recovery/recovery_pacemaker-light FAILED! So basically, systemd resources stops fine (step 1), core resources too (step 2), systemd resources starts again fine (step 3). So until here we're ok, the test part is done, what breaks is the recovery part in which the enablement of the core resources fails because of galera. But we need to add a consideration here: the step 1 of the recovery in theory WAITS for the resource to reach the desired status, so in this case we have two options: 1) There's something wrong with the way the script calculates galera status; 2) Galera breaks after starting correctly; We are doing tests to choose one of these two. (In reply to Raoul Scarazzini from comment #8) > Just to add more context on this, the way we got into the first error was > while testing a deployed tripleo/HA/Newton env. > Those tests where done using our CI, specifically this [1] project, and if > you look at this [2] function you will see that each action played on a > resource has a consequent wait time to get to the interested state. wait_resource_status() is wrong. It will incorrectly return as soon as the service is Stopped/Started on _any_ node, not _all_ nodes. Depending on the setup, this could be immediately, before _any_ action has been performed... all you need is a compute node under control of pacemaker remote (which never runs the service) or a control node with 'pcs resource ban' for that resource. Please use --wait when enabling and disabling resources. (In reply to Andrew Beekhof from comment #10) > (In reply to Raoul Scarazzini from comment #8) > > Just to add more context on this, the way we got into the first error was > > while testing a deployed tripleo/HA/Newton env. > > Those tests where done using our CI, specifically this [1] project, and if > > you look at this [2] function you will see that each action played on a > > resource has a consequent wait time to get to the interested state. > > wait_resource_status() is wrong. > It will incorrectly return as soon as the service is Stopped/Started on > _any_ node, not _all_ nodes. I agree that using --wait will be better, but wait_resource_status is doing right in my opinion, it waits for: 1) Timeout reached; 2) No output, which means that the desired status is reached on *all* the nodes, not just one of them. If there is at least one node with a different status the output will not be empty and so it will become a FAILURE (once timeout will be reached); Do you agree? Am I missing something? In addition: the only problem with wait_resource_status was about Master/Slave resources which were not considered (now I patched the functions, see here https://github.com/rscarazz/tripleo-director-ha-test-suite/blob/master/include/functions#L78) and this was causing the exit status to be always 0 for this kind of resources (galera), causing the problem that started this bug. > Depending on the setup, this could be immediately, before _any_ action has > been performed... all you need is a compute node under control of pacemaker > remote (which never runs the service) or a control node with 'pcs resource > ban' for that resource. > > Please use --wait when enabling and disabling resources. What I am concerned about using --wait (even if I agree that seems to be a more reliable approach) is that it involves also all the dependent resource in the wait time. The function wait_resource_status looks just at the desired resource, once it reaches the status it exits, but in any case the overall cluster status is checked (if you look at the tests there's always a check_failed_actions after every action), so if a dependent resource will have problems test will fail anyhow. (In reply to Raoul Scarazzini from comment #11) > (In reply to Andrew Beekhof from comment #10) > > (In reply to Raoul Scarazzini from comment #8) > > > Just to add more context on this, the way we got into the first error was > > > while testing a deployed tripleo/HA/Newton env. > > > Those tests where done using our CI, specifically this [1] project, and if > > > you look at this [2] function you will see that each action played on a > > > resource has a consequent wait time to get to the interested state. > > > > wait_resource_status() is wrong. > > It will incorrectly return as soon as the service is Stopped/Started on > > _any_ node, not _all_ nodes. > > I agree that using --wait will be better, but wait_resource_status is doing > right in my opinion, It clearly wasn't, otherwise this bug wouldn't exist. You've already had to update it this weekend to support master/slave. However the race conditions remain and so the function cannot be considered reliable. > What I am concerned about using --wait (even if I agree that seems to be a > more reliable approach) is that it involves also all the dependent resource > in the wait time. > The function wait_resource_status looks just at the desired resource, once > it reaches the status it exits, but in any case the overall cluster status > is checked (if you look at the tests there's always a check_failed_actions > after every action), so if a dependent resource will have problems test will > fail anyhow. A cluster that can't reach a stable state has a problem that should be investigated. Failing the test is the right thing to do. So heads up, I think what made the resource agent confused is that pacemaker is allowed to promote the resource right after it has been demoted, without going through a start or monitor op first. The agent wrongly assumes that it's the only one which can promote a resource raising its master score. During a demote, the master score is not cleared; the agent let pacemaker clean it after the demote op finishes. However, pacemaker can be asked to promote the resource right after a demote, which is likely the case triggered in this bz. In such situation, pacemaker is allowed to perform the promotion since the master score is still set. Changing the resource agent to make it clear the master score should fix the issue. Proposed fix in https://github.com/ClusterLabs/resource-agents/pull/885 has been merged in upstream. Verified on: resource-agents-3.9.5-84.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844 *** Bug 1380451 has been marked as a duplicate of this bug. *** |