Bug 1624441 - OSP13: pacemaker reports failed status of remote pcmk resource upon fresh overcloud deploy
Summary: OSP13: pacemaker reports failed status of remote pcmk resource upon fresh ove...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z5
: 13.0 (Queens)
Assignee: Michele Baldessari
QA Contact: Marian Krcmarik
URL:
Whiteboard:
Depends On:
Blocks: 1661551 1714843
TreeView+ depends on / blocked
 
Reported: 2018-08-31 15:12 UTC by Marian Krcmarik
Modified: 2024-10-01 16:10 UTC (History)
15 users (show)

Fixed In Version: puppet-tripleo-8.3.6-14.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1661551 (view as bug list)
Environment:
Last Closed: 2019-03-14 13:54:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1807906 0 None None None 2018-12-11 10:59:47 UTC
OpenStack gerrit 624349 0 'None' MERGED Fix up ordering of remote authkeys and a couple of pcs commands 2020-12-17 00:41:07 UTC
OpenStack gerrit 624352 0 'None' MERGED Make sure we do not match multiple remotes when waiting for them 2020-12-17 00:41:07 UTC
Red Hat Issue Tracker OSP-6371 0 None None None 2022-08-10 16:43:59 UTC
Red Hat Product Errata RHBA-2019:0448 0 None None None 2019-03-14 13:55:04 UTC

Description Marian Krcmarik 2018-08-31 15:12:49 UTC
Description of problem:
If OSP13 (did not observed on older releases) is deployed with pacemaker remote nodes for i.e. Instance HA compute nodes, galera/rabbitmq nodes then pacemaker reports all the pcmk remote resource as Failed. In the end the resource is Started and wokking as expected but The error msgs remain in the output of pacemaker status, not sure If they are any other comsequences - none observed yet, but It can confused possibly customers.

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Always on OSP13

Steps to Reproduce:
1. Deploy Openstack with remote pacemaker nodes for composable roles such as Database, Messaging or IHA compute

Actual results:
pcs status reports sth like:
Failed Actions:
* database-1_start_0 on controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:38:22 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-1 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:40:31 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-1 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:42:39 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-1 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:44:36 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-1 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:46:44 2018', queued=0ms, exec=0ms
* database-1_start_0 on controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:39:20 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-2 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:41:29 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-2 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:43:36 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-2 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:45:34 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-2 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:47:42 2018', queued=0ms, exec=0ms

Expected results:
No error msgs

Additional info:

Comment 1 Michele Baldessari 2018-09-04 15:33:59 UTC
So there are fundamentally two main areas of investigation here:
A) The case of pacemaker remote on IHA compute nodes 
B) The case of pacemaker remote nodes used in composable HA (for example database role that uses PaceakerRemote in lieu of Pacemaker)

For case A) I think the reason we sometimes see the timeouts (although to a certain degree some *are* expected due to the fact that pcmk_remote starts on the compute node during the same deployment step that creates the cluster) is the following change: https://review.openstack.org/#/c/569565/

In that change we sligthly tweaked the ordering of authkey creation / cluster start. While the fix was very much needed (otherwise things *could* fail from time to time) I did observe a bunch of spurious TLS authentication failures during a remote setup which implies that there is still something amiss.
Things eventually work because pacemaker will reread the key from /etc/pacemaker/authkey after a certain amount of time.


Case B) might be very related to case A) as well, but I have no investigated it sufficiently yet to be able to make any claim.

Comment 18 errata-xmlrpc 2019-03-14 13:54:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0448

Comment 20 Red Hat Bugzilla 2023-09-15 01:27:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.