1624441 – OSP13: pacemaker reports failed status of remote pcmk resource upon fresh overcloud deploy

Bug 1624441 - OSP13: pacemaker reports failed status of remote pcmk resource upon fresh overcloud deploy

Summary: OSP13: pacemaker reports failed status of remote pcmk resource upon fresh ove...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z5
Target Release:	13.0 (Queens)
Assignee:	Michele Baldessari
QA Contact:	Marian Krcmarik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1661551 1714843
TreeView+	depends on / blocked

Reported:	2018-08-31 15:12 UTC by Marian Krcmarik
Modified:	2024-10-01 16:10 UTC (History)
CC List:	15 users (show)
Fixed In Version:	puppet-tripleo-8.3.6-14.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1661551 (view as bug list)
Environment:
Last Closed:	2019-03-14 13:54:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1807906	None	None	None	2018-12-11 10:59:47 UTC
OpenStack gerrit	624349	'None'	MERGED	Fix up ordering of remote authkeys and a couple of pcs commands	2020-12-17 00:41:07 UTC
OpenStack gerrit	624352	'None'	MERGED	Make sure we do not match multiple remotes when waiting for them	2020-12-17 00:41:07 UTC
Red Hat Issue Tracker	OSP-6371	None	None	None	2022-08-10 16:43:59 UTC
Red Hat Product Errata	RHBA-2019:0448	None	None	None	2019-03-14 13:55:04 UTC

Description Marian Krcmarik 2018-08-31 15:12:49 UTC

Description of problem:
If OSP13 (did not observed on older releases) is deployed with pacemaker remote nodes for i.e. Instance HA compute nodes, galera/rabbitmq nodes then pacemaker reports all the pcmk remote resource as Failed. In the end the resource is Started and wokking as expected but The error msgs remain in the output of pacemaker status, not sure If they are any other comsequences - none observed yet, but It can confused possibly customers.

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Always on OSP13

Steps to Reproduce:
1. Deploy Openstack with remote pacemaker nodes for composable roles such as Database, Messaging or IHA compute

Actual results:
pcs status reports sth like:
Failed Actions:
* database-1_start_0 on controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:38:22 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-1 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:40:31 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-1 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:42:39 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-1 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:44:36 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-1 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:46:44 2018', queued=0ms, exec=0ms
* database-1_start_0 on controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:39:20 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-2 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:41:29 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-2 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:43:36 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-2 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:45:34 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-2 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:47:42 2018', queued=0ms, exec=0ms

Expected results:
No error msgs

Additional info:

Comment 1 Michele Baldessari 2018-09-04 15:33:59 UTC

So there are fundamentally two main areas of investigation here:
A) The case of pacemaker remote on IHA compute nodes 
B) The case of pacemaker remote nodes used in composable HA (for example database role that uses PaceakerRemote in lieu of Pacemaker)

For case A) I think the reason we sometimes see the timeouts (although to a certain degree some *are* expected due to the fact that pcmk_remote starts on the compute node during the same deployment step that creates the cluster) is the following change: https://review.openstack.org/#/c/569565/

In that change we sligthly tweaked the ordering of authkey creation / cluster start. While the fix was very much needed (otherwise things *could* fail from time to time) I did observe a bunch of spurious TLS authentication failures during a remote setup which implies that there is still something amiss.
Things eventually work because pacemaker will reread the key from /etc/pacemaker/authkey after a certain amount of time.


Case B) might be very related to case A) as well, but I have no investigated it sufficiently yet to be able to make any claim.

Comment 18 errata-xmlrpc 2019-03-14 13:54:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0448

Comment 20 Red Hat Bugzilla 2023-09-15 01:27:38 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.