Bug 1847976

Summary: galera service has a hardcoded start timeout
Product: Red Hat OpenStack Reporter: Damien Ciabrini <dciabrin>
Component: puppet-tripleoAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: dabarzil, jjoyce, jschluet, ljozsa, lmiccini, mbultel, michele, slinaber, tvignaud
Target Milestone: betaKeywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-tripleo-11.5.0-1.20200914161840.f716ef5.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 07:08:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2020-06-17 13:29:37 UTC
Description of problem:
In a HA control plane, the galera service is allowed at most 300 seconds to start on a node and to joins an existing galera cluster.

While in the vast majority of cases this is largely sufficient due to how much data an OpenStack cloud stores in the galera cluster, sometimes it's desirable to have a longer timeout.

For instance, on very large clouds, or on slow machines, it may happen that a full DB synchronization transfer gigabytes of data across the network, and sometimes it takes more than the allowed 300s to finish.

We'd need a way to configure that timeout in Director.

Version-Release number of selected component (if applicable):
puppet-tripleo-11.4.1-0.20200402130302.b4678ba.el8ost.noarch

How reproducible:
always

Steps to Reproduce:
1. deploy an HA overcloud
2. observe the promote timeout of the galera resource with "pcs resource show galera-bundle"


Actual results:
Pacemaker will fail to see the galera resource as started if it takes more than 300s to start.

Expected results:
A means of overriding the promote timeout would help overcoming failures due to long running promotion.

Additional info:

Comment 4 dabarzil 2021-04-04 08:40:12 UTC
Hi Damian,
How can I force a promotion to be over 300s? or override of the promote timeout?

Comment 5 Damien Ciabrini 2021-04-06 13:06:57 UTC
Hey Daniel,

To validate this bz you need to deploy an overcloud with a hiera override somewhere in your templates, e.g. setting configuring the timeout to 900s:

parameter_defaults:
  ExtraConfig:
    tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout: 900

Once the overcloud is deployed, you should see the configured galera promote timeout with:

# sudo pcs resource config galera
 Resource: galera (class=ocf provider=heartbeat type=galera)
  Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain
  Meta Attrs: container-attribute-target=host master-max=3 ordered=true
  Operations: demote interval=0s timeout=120s (galera-demote-interval-0s)
              monitor interval=20s timeout=30s (galera-monitor-interval-20s)
              monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s)
              monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s)
>>>              promote interval=0s on-fail=block timeout=900s (galera-promote-interval-0s) <<<
              start interval=0s timeout=120s (galera-start-interval-0s)
              stop interval=0s timeout=120s (galera-stop-interval-0s)

Comment 8 errata-xmlrpc 2021-09-15 07:08:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483