Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this: - Get the current configuration and state of the cluster, including a list of active resources (list #1) - Set resource target-role to Stopped - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2) - Compare lists #1 and #2, and the difference is the resources that should stop - Periodically refresh the configuration and state until the list of active resources matches list #2 - Delete the target-role - Periodically refresh the configuration and state until the list of active resources matches list #1 Obviously if multiple restarts are happening simultaneously, or anything else is causing resources to start or stop, lists #1 and #2 are likely to be affected by the other activity, and the command will wait for the wrong conditions. The easiest fix from the pacemaker perspective is to document the limitation ;) I would recommend this instead: pcs resource disable $RSC --wait pcs resource enable $RSC --wait That will restart the resource, but waiting only until the cluster settles, rather than waiting for specific resources to stop and start. The downside is that it will succeed as long as the cluster settles, even if everything doesn't come back up. We could potentially implement a new option to restarts to behave like that instead of the current implementation.
Thanks a lot Ken! I think we had a sufficient number of successful runs with disable/enable in place of the restarts, that we are fairly confident in the new approach. Moving this on our lap so we can track the fix for OSP.
*** Bug 1855070 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284