Description of problem: In a HA control plane, the galera service is allowed at most 300 seconds to start on a node and to joins an existing galera cluster. While in the vast majority of cases this is largely sufficient due to how much data an OpenStack cloud stores in the galera cluster, sometimes it's desirable to have a longer timeout. For instance, on very large clouds, or on slow machines, it may happen that a full DB synchronization transfer gigabytes of data across the network, and sometimes it takes more than the allowed 300s to finish. We'd need a way to configure that timeout in Director. Version-Release number of selected component (if applicable): puppet-tripleo-11.4.1-0.20200402130302.b4678ba.el8ost.noarch How reproducible: always Steps to Reproduce: 1. deploy an HA overcloud 2. observe the promote timeout of the galera resource with "pcs resource show galera-bundle" Actual results: Pacemaker will fail to see the galera resource as started if it takes more than 300s to start. Expected results: A means of overriding the promote timeout would help overcoming failures due to long running promotion. Additional info:
Hi Damian, How can I force a promotion to be over 300s? or override of the promote timeout?
Hey Daniel, To validate this bz you need to deploy an overcloud with a hiera override somewhere in your templates, e.g. setting configuring the timeout to 900s: parameter_defaults: ExtraConfig: tripleo::profile::pacemaker::database::mysql_bundle::promote_timeout: 900 Once the overcloud is deployed, you should see the configured galera promote timeout with: # sudo pcs resource config galera Resource: galera (class=ocf provider=heartbeat type=galera) Attributes: additional_parameters=--open-files-limit=16384 cluster_host_map=controller-0:controller-0.internalapi.localdomain;controller-1:controller-1.internalapi.localdomain;controller-2:controller-2.internalapi.localdomain enable_creation=true log=/var/log/mysql/mysqld.log wsrep_cluster_address=gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain Meta Attrs: container-attribute-target=host master-max=3 ordered=true Operations: demote interval=0s timeout=120s (galera-demote-interval-0s) monitor interval=20s timeout=30s (galera-monitor-interval-20s) monitor interval=10s role=Master timeout=30s (galera-monitor-interval-10s) monitor interval=30s role=Slave timeout=30s (galera-monitor-interval-30s) >>> promote interval=0s on-fail=block timeout=900s (galera-promote-interval-0s) <<< start interval=0s timeout=120s (galera-start-interval-0s) stop interval=0s timeout=120s (galera-stop-interval-0s)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483