Bug 1847976 - galera service has a hardcoded start timeout
Summary: galera service has a hardcoded start timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Damien Ciabrini
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-17 13:29 UTC by Damien Ciabrini
Modified: 2022-08-30 11:59 UTC (History)
9 users (show)

Fixed In Version: puppet-tripleo-11.5.0-1.20200914161840.f716ef5.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 07:08:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 736135 0 None MERGED Make promote timeout configurable 2021-02-17 04:42:23 UTC
Red Hat Issue Tracker OSP-1776 0 None None None 2022-08-30 11:59:24 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:09:06 UTC

Description Damien Ciabrini 2020-06-17 13:29:37 UTC
Description of problem:
In a HA control plane, the galera service is allowed at most 300 seconds to start on a node and to joins an existing galera cluster.

While in the vast majority of cases this is largely sufficient due to how much data an OpenStack cloud stores in the galera cluster, sometimes it's desirable to have a longer timeout.

For instance, on very large clouds, or on slow machines, it may happen that a full DB synchronization transfer gigabytes of data across the network, and sometimes it takes more than the allowed 300s to finish.

We'd need a way to configure that timeout in Director.

Version-Release number of selected component (if applicable):
puppet-tripleo-11.4.1-0.20200402130302.b4678ba.el8ost.noarch

How reproducible:
always

Steps to Reproduce:
1. deploy an HA overcloud
2. observe the promote timeout of the galera resource with "pcs resource show galera-bundle"


Actual results:
Pacemaker will fail to see the galera resource as started if it takes more than 300s to start.

Expected results:
A means of overriding the promote timeout would help overcoming failures due to long running promotion.

Additional info:

Comment 4 dabarzil 2021-04-04 08:40:12 UTC
Hi Damian,
How can I force a promotion to be over 300s? or override of the promote timeout?

Comment 5 Damien Ciabrini 2021-04-06 13:06:57 UTC
Hey Daniel,

To validate this bz you need to deploy an overcloud with a hiera override somewhere in your templates, e.g. setting configuring the timeout to 900s:

parameter_defaults:
  ExtraConfig:
    tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout: 900

Once the overcloud is deployed, you should see the configured galera promote timeout with:

# sudo pcs resource config galera
 Resource: galera (class=ocf provider=heartbeat type=galera)
  Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain
  Meta Attrs: container-attribute-target=host master-max=3 ordered=true
  Operations: demote interval=0s timeout=120s (galera-demote-interval-0s)
              monitor interval=20s timeout=30s (galera-monitor-interval-20s)
              monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s)
              monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s)
>>>              promote interval=0s on-fail=block timeout=900s (galera-promote-interval-0s) <<<
              start interval=0s timeout=120s (galera-start-interval-0s)
              stop interval=0s timeout=120s (galera-stop-interval-0s)

Comment 8 errata-xmlrpc 2021-09-15 07:08:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.