Description of problem: If OSP13 (did not observed on older releases) is deployed with pacemaker remote nodes for i.e. Instance HA compute nodes, galera/rabbitmq nodes then pacemaker reports all the pcmk remote resource as Failed. In the end the resource is Started and wokking as expected but The error msgs remain in the output of pacemaker status, not sure If they are any other comsequences - none observed yet, but It can confused possibly customers. Version-Release number of selected component (if applicable): OSP13 How reproducible: Always on OSP13 Steps to Reproduce: 1. Deploy Openstack with remote pacemaker nodes for composable roles such as Database, Messaging or IHA compute Actual results: pcs status reports sth like: Failed Actions: * database-1_start_0 on controller-1 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:38:22 2018', queued=0ms, exec=0ms * database-2_start_0 on controller-1 'unknown error' (1): call=6, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:40:31 2018', queued=0ms, exec=0ms * messaging-0_start_0 on controller-1 'unknown error' (1): call=9, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:42:39 2018', queued=0ms, exec=0ms * messaging-1_start_0 on controller-1 'unknown error' (1): call=12, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:44:36 2018', queued=0ms, exec=0ms * messaging-2_start_0 on controller-1 'unknown error' (1): call=15, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:46:44 2018', queued=0ms, exec=0ms * database-1_start_0 on controller-2 'unknown error' (1): call=3, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:39:20 2018', queued=0ms, exec=0ms * database-2_start_0 on controller-2 'unknown error' (1): call=6, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:41:29 2018', queued=0ms, exec=0ms * messaging-0_start_0 on controller-2 'unknown error' (1): call=9, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:43:36 2018', queued=0ms, exec=0ms * messaging-1_start_0 on controller-2 'unknown error' (1): call=12, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:45:34 2018', queued=0ms, exec=0ms * messaging-2_start_0 on controller-2 'unknown error' (1): call=15, status=Timed Out, exitreason='', last-rc-change='Sat Aug 18 10:47:42 2018', queued=0ms, exec=0ms Expected results: No error msgs Additional info:
So there are fundamentally two main areas of investigation here: A) The case of pacemaker remote on IHA compute nodes B) The case of pacemaker remote nodes used in composable HA (for example database role that uses PaceakerRemote in lieu of Pacemaker) For case A) I think the reason we sometimes see the timeouts (although to a certain degree some *are* expected due to the fact that pcmk_remote starts on the compute node during the same deployment step that creates the cluster) is the following change: https://review.openstack.org/#/c/569565/ In that change we sligthly tweaked the ordering of authkey creation / cluster start. While the fix was very much needed (otherwise things *could* fail from time to time) I did observe a bunch of spurious TLS authentication failures during a remote setup which implies that there is still something amiss. Things eventually work because pacemaker will reread the key from /etc/pacemaker/authkey after a certain amount of time. Case B) might be very related to case A) as well, but I have no investigated it sufficiently yet to be able to make any claim.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0448
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days