Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1624441

Summary: OSP13: pacemaker reports failed status of remote pcmk resource upon fresh overcloud deploy
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: puppet-tripleoAssignee: Michele Baldessari <michele>
Status: CLOSED ERRATA QA Contact: Marian Krcmarik <mkrcmari>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: bshephar, chjones, dhill, dsanzmor, jjoyce, jmelvin, jschluet, j.thadden, michele, mkrcmari, pkomarov, rajsingh, slinaber, tvignaud, yocha
Target Milestone: z5Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-8.3.6-14.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1661551 (view as bug list) Environment:
Last Closed: 2019-03-14 13:54:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1661551, 1714843    

Description Marian Krcmarik 2018-08-31 15:12:49 UTC
Description of problem:
If OSP13 (did not observed on older releases) is deployed with pacemaker remote nodes for i.e. Instance HA compute nodes, galera/rabbitmq nodes then pacemaker reports all the pcmk remote resource as Failed. In the end the resource is Started and wokking as expected but The error msgs remain in the output of pacemaker status, not sure If they are any other comsequences - none observed yet, but It can confused possibly customers.

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Always on OSP13

Steps to Reproduce:
1. Deploy Openstack with remote pacemaker nodes for composable roles such as Database, Messaging or IHA compute

Actual results:
pcs status reports sth like:
Failed Actions:
* database-1_start_0 on controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:38:22 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-1 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:40:31 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-1 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:42:39 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-1 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:44:36 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-1 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:46:44 2018', queued=0ms, exec=0ms
* database-1_start_0 on controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:39:20 2018', queued=0ms, exec=0ms
* database-2_start_0 on controller-2 'unknown error' (1): call=6, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:41:29 2018', queued=0ms, exec=0ms
* messaging-0_start_0 on controller-2 'unknown error' (1): call=9, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:43:36 2018', queued=0ms, exec=0ms
* messaging-1_start_0 on controller-2 'unknown error' (1): call=12, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:45:34 2018', queued=0ms, exec=0ms
* messaging-2_start_0 on controller-2 'unknown error' (1): call=15, status=Timed Out, exitreason='',
    last-rc-change='Sat Aug 18 10:47:42 2018', queued=0ms, exec=0ms

Expected results:
No error msgs

Additional info:

Comment 1 Michele Baldessari 2018-09-04 15:33:59 UTC
So there are fundamentally two main areas of investigation here:
A) The case of pacemaker remote on IHA compute nodes 
B) The case of pacemaker remote nodes used in composable HA (for example database role that uses PaceakerRemote in lieu of Pacemaker)

For case A) I think the reason we sometimes see the timeouts (although to a certain degree some *are* expected due to the fact that pcmk_remote starts on the compute node during the same deployment step that creates the cluster) is the following change: https://review.openstack.org/#/c/569565/

In that change we sligthly tweaked the ordering of authkey creation / cluster start. While the fix was very much needed (otherwise things *could* fail from time to time) I did observe a bunch of spurious TLS authentication failures during a remote setup which implies that there is still something amiss.
Things eventually work because pacemaker will reread the key from /etc/pacemaker/authkey after a certain amount of time.


Case B) might be very related to case A) as well, but I have no investigated it sufficiently yet to be able to make any claim.

Comment 18 errata-xmlrpc 2019-03-14 13:54:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0448

Comment 20 Red Hat Bugzilla 2023-09-15 01:27:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days