1847976 – galera service has a hardcoded start timeout

Bug 1847976 - galera service has a hardcoded start timeout

Summary: galera service has a hardcoded start timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	16.2 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Damien Ciabrini
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-17 13:29 UTC by Damien Ciabrini
Modified:	2022-08-30 11:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:	puppet-tripleo-11.5.0-1.20200914161840.f716ef5.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-15 07:08:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	736135	None	MERGED	Make promote timeout configurable	2021-02-17 04:42:23 UTC
Red Hat Issue Tracker	OSP-1776	None	None	None	2022-08-30 11:59:24 UTC
Red Hat Product Errata	RHEA-2021:3483	None	None	None	2021-09-15 07:09:06 UTC

Description Damien Ciabrini 2020-06-17 13:29:37 UTC

Description of problem:
In a HA control plane, the galera service is allowed at most 300 seconds to start on a node and to joins an existing galera cluster.

While in the vast majority of cases this is largely sufficient due to how much data an OpenStack cloud stores in the galera cluster, sometimes it's desirable to have a longer timeout.

For instance, on very large clouds, or on slow machines, it may happen that a full DB synchronization transfer gigabytes of data across the network, and sometimes it takes more than the allowed 300s to finish.

We'd need a way to configure that timeout in Director.

Version-Release number of selected component (if applicable):
puppet-tripleo-11.4.1-0.20200402130302.b4678ba.el8ost.noarch

How reproducible:
always

Steps to Reproduce:
1. deploy an HA overcloud
2. observe the promote timeout of the galera resource with "pcs resource show galera-bundle"


Actual results:
Pacemaker will fail to see the galera resource as started if it takes more than 300s to start.

Expected results:
A means of overriding the promote timeout would help overcoming failures due to long running promotion.

Additional info:

Comment 4 dabarzil 2021-04-04 08:40:12 UTC

Hi Damian,
How can I force a promotion to be over 300s? or override of the promote timeout?

Comment 5 Damien Ciabrini 2021-04-06 13:06:57 UTC

Hey Daniel,

To validate this bz you need to deploy an overcloud with a hiera override somewhere in your templates, e.g. setting configuring the timeout to 900s:

parameter_defaults:
  ExtraConfig:
    tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout: 900

Once the overcloud is deployed, you should see the configured galera promote timeout with:

# sudo pcs resource config galera
 Resource: galera (class=ocf provider=heartbeat type=galera)
  Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain
  Meta Attrs: container-attribute-target=host master-max=3 ordered=true
  Operations: demote interval=0s timeout=120s (galera-demote-interval-0s)
              monitor interval=20s timeout=30s (galera-monitor-interval-20s)
              monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s)
              monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s)
>>>              promote interval=0s on-fail=block timeout=900s (galera-promote-interval-0s) <<<
              start interval=0s timeout=120s (galera-start-interval-0s)
              stop interval=0s timeout=120s (galera-stop-interval-0s)

Comment 8 errata-xmlrpc 2021-09-15 07:08:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.