Bug 1847976

Summary:	galera service has a hardcoded start timeout
Product:	Red Hat OpenStack	Reporter:	Damien Ciabrini <dciabrin>
Component:	puppet-tripleo	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED ERRATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.2 (Train)	CC:	dabarzil, jjoyce, jschluet, ljozsa, lmiccini, mbultel, michele, slinaber, tvignaud
Target Milestone:	beta	Keywords:	Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	puppet-tripleo-11.5.0-1.20200914161840.f716ef5.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-15 07:08:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2020-06-17 13:29:37 UTC

Description of problem:
In a HA control plane, the galera service is allowed at most 300 seconds to start on a node and to joins an existing galera cluster.

While in the vast majority of cases this is largely sufficient due to how much data an OpenStack cloud stores in the galera cluster, sometimes it's desirable to have a longer timeout.

For instance, on very large clouds, or on slow machines, it may happen that a full DB synchronization transfer gigabytes of data across the network, and sometimes it takes more than the allowed 300s to finish.

We'd need a way to configure that timeout in Director.

Version-Release number of selected component (if applicable):
puppet-tripleo-11.4.1-0.20200402130302.b4678ba.el8ost.noarch

How reproducible:
always

Steps to Reproduce:
1. deploy an HA overcloud
2. observe the promote timeout of the galera resource with "pcs resource show galera-bundle"


Actual results:
Pacemaker will fail to see the galera resource as started if it takes more than 300s to start.

Expected results:
A means of overriding the promote timeout would help overcoming failures due to long running promotion.

Additional info:

Comment 4 dabarzil 2021-04-04 08:40:12 UTC

Hi Damian,
How can I force a promotion to be over 300s? or override of the promote timeout?

Comment 5 Damien Ciabrini 2021-04-06 13:06:57 UTC

Hey Daniel,

To validate this bz you need to deploy an overcloud with a hiera override somewhere in your templates, e.g. setting configuring the timeout to 900s:

parameter_defaults:
  ExtraConfig:
    tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout: 900

Once the overcloud is deployed, you should see the configured galera promote timeout with:

# sudo pcs resource config galera
 Resource: galera (class=ocf provider=heartbeat type=galera)
  Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain
  Meta Attrs: container-attribute-target=host master-max=3 ordered=true
  Operations: demote interval=0s timeout=120s (galera-demote-interval-0s)
              monitor interval=20s timeout=30s (galera-monitor-interval-20s)
              monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s)
              monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s)
>>>              promote interval=0s on-fail=block timeout=900s (galera-promote-interval-0s) <<<
              start interval=0s timeout=120s (galera-start-interval-0s)
              stop interval=0s timeout=120s (galera-stop-interval-0s)

Comment 8 errata-xmlrpc 2021-09-15 07:08:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483