Bug 1868113

Summary:	concurrent pcs resource restarts sometimes fail
Product:	Red Hat OpenStack	Reporter:	Michele Baldessari <michele>
Component:	openstack-tripleo-heat-templates	Assignee:	Michele Baldessari <michele>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	cluster-maint, jpretori, lmiccini, mburns
Target Milestone:	z2	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20200914170155.29a02c1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-28 15:38:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Ken Gaillot 2020-08-17 21:47:29 UTC

Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1

Obviously if multiple restarts are happening simultaneously, or anything else is causing resources to start or stop, lists #1 and #2 are likely to be affected by the other activity, and the command will wait for the wrong conditions.

The easiest fix from the pacemaker perspective is to document the limitation ;)

I would recommend this instead:

   pcs resource disable $RSC --wait
   pcs resource enable $RSC --wait

That will restart the resource, but waiting only until the cluster settles, rather than waiting for specific resources to stop and start. The downside is that it will succeed as long as the cluster settles, even if everything doesn't come back up.

We could potentially implement a new option to restarts to behave like that instead of the current implementation.

Comment 2 Michele Baldessari 2020-08-19 14:45:45 UTC

Thanks a lot Ken!

I think we had a sufficient number of successful runs with disable/enable in place of the restarts, that we are fairly confident in the new approach. Moving this on our lap so we can track the fix for OSP.

Comment 4 Michele Baldessari 2020-08-25 07:24:04 UTC

*** Bug 1855070 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2020-10-28 15:38:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284