Bug 1868113

Summary: concurrent pcs resource restarts sometimes fail
Product: Red Hat OpenStack Reporter: Michele Baldessari <michele>
Component: openstack-tripleo-heat-templatesAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: cluster-maint, jpretori, lmiccini, mburns
Target Milestone: z2Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170155.29a02c1.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-28 15:38:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Ken Gaillot 2020-08-17 21:47:29 UTC
Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1

Obviously if multiple restarts are happening simultaneously, or anything else is causing resources to start or stop, lists #1 and #2 are likely to be affected by the other activity, and the command will wait for the wrong conditions.

The easiest fix from the pacemaker perspective is to document the limitation ;)

I would recommend this instead:

   pcs resource disable $RSC --wait
   pcs resource enable $RSC --wait

That will restart the resource, but waiting only until the cluster settles, rather than waiting for specific resources to stop and start. The downside is that it will succeed as long as the cluster settles, even if everything doesn't come back up.

We could potentially implement a new option to restarts to behave like that instead of the current implementation.

Comment 2 Michele Baldessari 2020-08-19 14:45:45 UTC
Thanks a lot Ken!

I think we had a sufficient number of successful runs with disable/enable in place of the restarts, that we are fairly confident in the new approach. Moving this on our lap so we can track the fix for OSP.

Comment 4 Michele Baldessari 2020-08-25 07:24:04 UTC
*** Bug 1855070 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2020-10-28 15:38:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284