Bug 1868113 - concurrent pcs resource restarts sometimes fail
Summary: concurrent pcs resource restarts sometimes fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Michele Baldessari
QA Contact: pkomarov
URL:
Whiteboard:
: 1855070 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 17:44 UTC by Michele Baldessari
Modified: 2020-10-28 15:39 UTC (History)
4 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170155.29a02c1.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:38:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1892206 0 None None None 2020-08-19 15:01:51 UTC
OpenStack gerrit 746662 0 None MERGED Fix pcs restart in composable HA 2020-10-19 16:35:53 UTC
OpenStack gerrit 746937 0 None MERGED Drop bootstrap_host_exec from pacemaker_restart_bundle 2020-10-19 16:35:53 UTC
OpenStack gerrit 746957 0 None MERGED Fix HA resource restart when no replicas are running 2020-10-19 16:35:53 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:39:12 UTC

Comment 1 Ken Gaillot 2020-08-17 21:47:29 UTC
Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1

Obviously if multiple restarts are happening simultaneously, or anything else is causing resources to start or stop, lists #1 and #2 are likely to be affected by the other activity, and the command will wait for the wrong conditions.

The easiest fix from the pacemaker perspective is to document the limitation ;)

I would recommend this instead:

   pcs resource disable $RSC --wait
   pcs resource enable $RSC --wait

That will restart the resource, but waiting only until the cluster settles, rather than waiting for specific resources to stop and start. The downside is that it will succeed as long as the cluster settles, even if everything doesn't come back up.

We could potentially implement a new option to restarts to behave like that instead of the current implementation.

Comment 2 Michele Baldessari 2020-08-19 14:45:45 UTC
Thanks a lot Ken!

I think we had a sufficient number of successful runs with disable/enable in place of the restarts, that we are fairly confident in the new approach. Moving this on our lap so we can track the fix for OSP.

Comment 4 Michele Baldessari 2020-08-25 07:24:04 UTC
*** Bug 1855070 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2020-10-28 15:38:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.